Friday, June 13, 2014

Handling OP_READ_BLOCK and File Not Found error in Cloudera Search

While using Cloudera Search in a (Near Real-Time) NRT scenario, occasionally during periods of high writes (index updates), I see the following error thrown on the solr logs

Failed to connect to /0.0.0.0:50010 for file /solr/collectionName/core_node1/data/index.20140528010609412/_zsae_Lucene41_0.doc for block BP-658256793-10.130.36.84-1390532185717:blk_-197867398637806450_45226458:java.io.IOException: Got error for OP_READ_BLOCK,​ self=/0.0.0.0:42978,​ remote=/0.0.0.0:50010,​ for file /solr/collectionName/core_node1/data/index.20140528010609412/_zsae_Lucene41_0.doc,​ for pool BP-658256793-10.130.36.84-1390532185717 block -197867398637806450_45226458

DFS chooseDataNode: got # 1 IOException,​ will wait for 1064.4631017325812 msec.

java.io.FileNotFoundException: File does not exist: /solr/collectionName/core_node1/data/index.20140528010609412/_zsae_Lucene41_0.doc

On troubleshooting further, the logs on datanode side show the following READ_BLOCK errors
DatanodeRegistration(<serverip>, storageID=DS-1520167466-0.0.0.0-50010-1389140294030, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=cluster25;nsid=1124238169;c=0):Got exception while serving BP-1661432518-0.0.0.0-1389140286721:blk_5701049520037157281_7691487 to /ip:45293
org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not found for BP-1661432518-10.130.71.219-1389140286721:blk_5701049520037157281_7691487
 at org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:382)
 at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:193)
 at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:326)
 at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
 at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
 at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
 at java.lang.Thread.run(Thread.java:662)
node1:50010:DataXceiver error processing READ_BLOCK operation  src: /0.0.0.0:45293 dest: /0.0.0.0:50010
org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not found for BP-1661432518-0.0.0.0-1389140286721:blk_5701049520037157281_7691487
 at org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:382)
 at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:193)
 at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:326)
 at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
 at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
 at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
 at java.lang.Thread.run(Thread.java:662)
This issue is related to this bug (https://issues.apache.org/jira/browse/SOLR-5693), where HDFS file merging does not work correctly with NRT search. This would lead to search failure with file not found exceptions. Again occurrence of the error depends on the velocity of the data changes. The issue has been backported to Cloudera Search and is supposed to be fixed with the next release of Search.

In versions 1.2 and below, a workaround to minimize the occurence of this failure is to do more frequent commits (reduce the autocommit frequency)

In the event of a catastrophic failure, when the replica goes down because of this issue and does NOT recover, you might have to manually copy over the index files on HDFS from another replica which is up and running. The error log specifies the actual index files that are missing because of the merge.

The index data files are usually stored in HDFS under /solr/{collection-name}/{core_node#}/data

There are two files in this directory which are important in identifying the state of the index:

  • index.properties - this file shows the current active index folder something like this:

#index.properties
#Wed Apr 30 23:57:38 PDT 2014
index=index.20140430235104131

  • replication.properties - this file shows replication state of the core

#Replication details
#Fri May 02 15:57:16 PDT 2014
previousCycleTimeInSeconds=486
indexReplicatedAtList=1399071436498,1399052241128,1398927458820,1398833578758,1398833453263,1398799104838
indexReplicatedAt=1399071436498
timesIndexReplicated=22
lastCycleBytesDownloaded=6012236791

I have used the "File Browser" application inside "Hue" to browse the files to get a  sense of where the files are located and which ones are missing. Below are the steps to fix the issue:

  1. Identify the core (eg: core_node1) and the file(s) missing from the logs
  2. Stop the solr process on the node with the failure. You can use Cloudera Manager to do this.
  3. Copy the index folder (index.xxxxx)from the replica that is working to the failed node. The copy can be done by either:

  • Using "hadoop fs -cp" command
  • Using Hue application.
     4. Copy the index.properties and replication.properties file.
     5. Restart the service.

In both cases, the copy operation has to be done in the context of a user (eg: hdfs) which has write permissions on the HDFS location.

No comments:

Post a Comment