While using Cloudera Search in a (Near Real-Time) NRT scenario, occasionally during periods of high writes (index updates), I see the following error thrown on the solr logs
Failed to connect to /0.0.0.0:50010 for file /solr/collectionName/core_node1/data/index.20140528010609412/_zsae_Lucene41_0.doc for block BP-658256793-10.130.36.84-1390532185717:blk_-197867398637806450_45226458:java.io.IOException: Got error for OP_READ_BLOCK, self=/0.0.0.0:42978, remote=/0.0.0.0:50010, for file /solr/collectionName/core_node1/data/index.20140528010609412/_zsae_Lucene41_0.doc, for pool BP-658256793-10.130.36.84-1390532185717 block -197867398637806450_45226458
DFS chooseDataNode: got # 1 IOException, will wait for 1064.4631017325812 msec.
java.io.FileNotFoundException: File does not exist: /solr/collectionName/core_node1/data/index.20140528010609412/_zsae_Lucene41_0.doc
On troubleshooting further, the logs on datanode side show the following READ_BLOCK errors
In versions 1.2 and below, a workaround to minimize the occurence of this failure is to do more frequent commits (reduce the autocommit frequency)
In the event of a catastrophic failure, when the replica goes down because of this issue and does NOT recover, you might have to manually copy over the index files on HDFS from another replica which is up and running. The error log specifies the actual index files that are missing because of the merge.
The index data files are usually stored in HDFS under /solr/{collection-name}/{core_node#}/data
There are two files in this directory which are important in identifying the state of the index:
I have used the "File Browser" application inside "Hue" to browse the files to get a sense of where the files are located and which ones are missing. Below are the steps to fix the issue:
5. Restart the service.
In both cases, the copy operation has to be done in the context of a user (eg: hdfs) which has write permissions on the HDFS location.
Failed to connect to /0.0.0.0:50010 for file /solr/collectionName/core_node1/data/index.20140528010609412/_zsae_Lucene41_0.doc for block BP-658256793-10.130.36.84-1390532185717:blk_-197867398637806450_45226458:java.io.IOException: Got error for OP_READ_BLOCK, self=/0.0.0.0:42978, remote=/0.0.0.0:50010, for file /solr/collectionName/core_node1/data/index.20140528010609412/_zsae_Lucene41_0.doc, for pool BP-658256793-10.130.36.84-1390532185717 block -197867398637806450_45226458
DFS chooseDataNode: got # 1 IOException, will wait for 1064.4631017325812 msec.
java.io.FileNotFoundException: File does not exist: /solr/collectionName/core_node1/data/index.20140528010609412/_zsae_Lucene41_0.doc
On troubleshooting further, the logs on datanode side show the following READ_BLOCK errors
DatanodeRegistration(<serverip>, storageID=DS-1520167466-0.0.0.0-50010-1389140294030, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=cluster25;nsid=1124238169;c =0):Got exception while serving BP-1661432518-0.0.0.0-1389140286721:blk_5701 049520037157281_7691487 to /ip:45293 org.apache.hadoop.hdfs.server.datanode.ReplicaNotF oundException: Replica not found for BP-1661432518-10.130.71.219-1389140286721:blk_5701 049520037157281_7691487 at org.apache.hadoop.hdfs.server.datanode.BlockSender .getReplica(BlockSender.java:382) at org.apache.hadoop.hdfs.server.datanode.BlockSender .<init>(BlockSender.java:193) at org.apache.hadoop.hdfs.server.datanode.DataXceiver .readBlock(DataXceiver.java:326) at org.apache.hadoop.hdfs.protocol.datatransfer.Recei ver.opReadBlock(Receiver.java:92) at org.apache.hadoop.hdfs.protocol.datatransfer.Recei ver.processOp(Receiver.java:64) at org.apache.hadoop.hdfs.server.datanode.DataXceiver .run(DataXceiver.java:221) at java.lang.Thread.run(Thread.java:662)
This issue is related to this bug (https://issues.apache.org/jira/browse/SOLR-5693), where HDFS file merging does not work correctly with NRT search. This would lead to search failure with file not found exceptions. Again occurrence of the error depends on the velocity of the data changes. The issue has been backported to Cloudera Search and is supposed to be fixed with the next release of Search.node1:50010:DataXceiver error processing READ_BLOCK operation src: /0.0.0.0:45293 dest: /0.0.0.0:50010 org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not found for BP-1661432518-0.0.0.0-1389140286721:blk_5701 049520037157281_7691487 at org.apache.hadoop.hdfs.server.datanode.BlockSender .getReplica(BlockSender.java:382) at org.apache.hadoop.hdfs.server.datanode.BlockSender .<init>(BlockSender.java:193) at org.apache.hadoop.hdfs.server.datanode.DataXceiver .readBlock(DataXceiver.java:326) at org.apache.hadoop.hdfs.protocol.datatransfer.Recei ver.opReadBlock(Receiver.java:92) at org.apache.hadoop.hdfs.protocol.datatransfer.Recei ver.processOp(Receiver.java:64) at org.apache.hadoop.hdfs.server.datanode.DataXceiver .run(DataXceiver.java:221) at java.lang.Thread.run(Thread.java:662)
In versions 1.2 and below, a workaround to minimize the occurence of this failure is to do more frequent commits (reduce the autocommit frequency)
In the event of a catastrophic failure, when the replica goes down because of this issue and does NOT recover, you might have to manually copy over the index files on HDFS from another replica which is up and running. The error log specifies the actual index files that are missing because of the merge.
The index data files are usually stored in HDFS under /solr/{collection-name}/{core_node#}/data
There are two files in this directory which are important in identifying the state of the index:
- index.properties - this file shows the current active index folder something like this:
#index.properties #Wed Apr 30 23:57:38 PDT 2014 index=index.20140430235104131
- replication.properties - this file shows replication state of the core
#Replication details #Fri May 02 15:57:16 PDT 2014 previousCycleTimeInSeconds=486 indexReplicatedAtList=1399071436498,1399052241128,1398927458820,1398833578758,1398833453263,1398799104838 indexReplicatedAt=1399071436498 timesIndexReplicated=22 lastCycleBytesDownloaded=6012236791
I have used the "File Browser" application inside "Hue" to browse the files to get a sense of where the files are located and which ones are missing. Below are the steps to fix the issue:
- Identify the core (eg: core_node1) and the file(s) missing from the logs
- Stop the solr process on the node with the failure. You can use Cloudera Manager to do this.
- Copy the index folder (index.xxxxx)from the replica that is working to the failed node. The copy can be done by either:
- Using "hadoop fs -cp" command
- Using Hue application.
5. Restart the service.
In both cases, the copy operation has to be done in the context of a user (eg: hdfs) which has write permissions on the HDFS location.
No comments:
Post a Comment