In this post, I will explain some of the key factors that should be considered while
optimizing performance (for querying and indexing) on implementations using
Cloudera Search. Since Cloudera Search is built on top of
SolrCloud, most of the performance considerations that are applicable for SOLR applications are still applicable for Cloudera Search. Please refer to this
link for an overview of the main factors that affect SOLR performance (in general).
Below are some of the main factors that I found useful while optimizing performance on
Cloudera Search
Block Caching
As we know, Cloudera Search uses HDFS filesystem to store the indexes. In order to optimize the performance, HDFS Block cache option is available. The block cahce works by caching the HDFS index blocks in JVM direct memory. Block cache can be enabled in solrconfig.xml. Below are the key parameters involved in tuning the block cache:
1.
Enable Block Cache - by setting
solr.hdfs.blockcache.enabled to be
true. Once block cache is enabled, the read and write caches can be enabled/disabled separately through the following settings:
- solr.hdfs.blockcache.read.enabled
- solr.hdfs.blockcache.write.enabled
There is a
known issue with enabling block cache writing which may lead to irrevocable corrupt indexes. So, it is very important that this is disabled by setting solr.hdfs.blockcache.read.enabled to be false.
2.
Configure Memory Slab Count - The slab count determines the number of memory slabs to allocate, where each slab is 128 MB. Allocate the slab count to a sufficiently higher number that is required for the specific application (based on schema and query access patterns)
This is done by setting the
solr.hdfs.blockcache.slab.count parameter
3.
Enable Global Block Cache - Enabling the global block cache would allow multiple solrcores on the same node to share a global HDFS block cache. It is done by setting
solr.hdfs.blockcache.global to be
true
4.
NRTCachingDirectory - If using the Near Real-Time (NRT) setup, then enabling the NRTCachingDirectory (solr.hdfs.nrtcachingdirectory.enable) and tuning the maxCachedmb (solr.hdfs.nrtcachingdirectory.maxcachedmb) and merge size (solr.hdfs.nrtcachingdirectory.maxmergesizemb) thresholds also help.
Below is the xml section (in solrconfig.xml) with the above parameters:
<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
<bool name="solr.hdfs.blockcache.enabled">true</bool>
<bool name="solr.hdfs.blockcache.read.enabled">true</bool>
<bool name="solr.hdfs.blockcache.write.enabled">false</bool>
<int name="solr.hdfs.blockcache.slab.count">100</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
<int name="solr.hdfs.blockcache.blocksperbank">16384</int>
<bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">64</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">1024</int>
</directoryFactory>
SolrCloud Caching
The caching related settings play a major role in influencing the query performance/response time. Since SOLR caching are based on the Index Searches, the caches are available as long as the searchers are valid. Solr has support for FastLRUCache which has faster reads which can be set via (Solr.FastLRUCache) in solrconfig.xml file. There are 3 types of caching available in SOLR
FilterCache - This cache stores unordered sets of document IDs matching the queries. It stores the results of any filter queries (fq) by executing and caching each filter separately. Setting this cache to sufficiently higher value helps in caching commonly used filter queries. Below is a sample configuration for the filter cache:
QueryResultCache - This cache stores the top N results of a query. Since it stores only the document IDs returned by the query, the memory usage of this cache is less compared to that of the filterCache.
DocumentCache - This cache stores the Lucene Document objects that have been fetched from disk. The size of the documentCache memory is dependent on the number and type of fields stored in the document.
Below is a sample configuration setting for the above three cache types:
<filterCache class="solr.FastLRUCache"
size="20000"
initialSize="5000"
autowarmCount="1000"/>
<queryResultCache class="solr.FastLRUCache"
size="10000"
initialSize="5000"
autowarmCount="1000"/>
<documentCache class="solr.FastLRUCache"
size="5000"
initialSize="2000"
autowarmCount="1000"/>
Disable Swapping
As with other hadoop systems, it is recommended to disable Linux swapping on all solr nodes.
To get more details on tuning Cloudera Search, please refer to this
link