In this post, I will explain some of the key factors that should be considered while optimizing performance (for querying and indexing) on implementations using Cloudera Search. Since Cloudera Search is built on top of SolrCloud, most of the performance considerations that are applicable for SOLR applications are still applicable for Cloudera Search. Please refer to this link for an overview of the main factors that affect SOLR performance (in general).
Below are some of the main factors that I found useful while optimizing performance on Cloudera Search
Block Caching
As we know, Cloudera Search uses HDFS filesystem to store the indexes. In order to optimize the performance, HDFS Block cache option is available. The block cahce works by caching the HDFS index blocks in JVM direct memory. Block cache can be enabled in solrconfig.xml. Below are the key parameters involved in tuning the block cache:
1. Enable Block Cache - by setting solr.hdfs.blockcache.enabled to be true. Once block cache is enabled, the read and write caches can be enabled/disabled separately through the following settings:
There is a known issue with enabling block cache writing which may lead to irrevocable corrupt indexes. So, it is very important that this is disabled by setting solr.hdfs.blockcache.read.enabled to be false.
- solr.hdfs.blockcache.read.enabled
- solr.hdfs.blockcache.write.enabled
2. Configure Memory Slab Count - The slab count determines the number of memory slabs to allocate, where each slab is 128 MB. Allocate the slab count to a sufficiently higher number that is required for the specific application (based on schema and query access patterns)
This is done by setting the solr.hdfs.blockcache.slab.count parameter
3. Enable Global Block Cache - Enabling the global block cache would allow multiple solrcores on the same node to share a global HDFS block cache. It is done by setting solr.hdfs.blockcache.global to be true
4. NRTCachingDirectory - If using the Near Real-Time (NRT) setup, then enabling the NRTCachingDirectory (solr.hdfs.nrtcachingdirectory.enable) and tuning the maxCachedmb (solr.hdfs.nrtcachingdirectory.maxcachedmb) and merge size (solr.hdfs.nrtcachingdirectory.maxmergesizemb) thresholds also help.
Below is the xml section (in solrconfig.xml) with the above parameters:
<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory"> <bool name="solr.hdfs.blockcache.enabled">true</bool> <bool name="solr.hdfs.blockcache.read.enabled">true</bool> <bool name="solr.hdfs.blockcache.write.enabled">false</bool> <int name="solr.hdfs.blockcache.slab.count">100</int> <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool> <int name="solr.hdfs.blockcache.blocksperbank">16384</int> <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool> <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">64</int> <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">1024</int> </directoryFactory>
SolrCloud Caching
The caching related settings play a major role in influencing the query performance/response time. Since SOLR caching are based on the Index Searches, the caches are available as long as the searchers are valid. Solr has support for FastLRUCache which has faster reads which can be set via (Solr.FastLRUCache) in solrconfig.xml file. There are 3 types of caching available in SOLR
FilterCache - This cache stores unordered sets of document IDs matching the queries. It stores the results of any filter queries (fq) by executing and caching each filter separately. Setting this cache to sufficiently higher value helps in caching commonly used filter queries. Below is a sample configuration for the filter cache:
QueryResultCache - This cache stores the top N results of a query. Since it stores only the document IDs returned by the query, the memory usage of this cache is less compared to that of the filterCache.
DocumentCache - This cache stores the Lucene Document objects that have been fetched from disk. The size of the documentCache memory is dependent on the number and type of fields stored in the document.
Below is a sample configuration setting for the above three cache types:
<filterCache class="solr.FastLRUCache" size="20000" initialSize="5000" autowarmCount="1000"/> <queryResultCache class="solr.FastLRUCache" size="10000" initialSize="5000" autowarmCount="1000"/> <documentCache class="solr.FastLRUCache" size="5000" initialSize="2000" autowarmCount="1000"/>
Disable Swapping
As with other hadoop systems, it is recommended to disable Linux swapping on all solr nodes.
To get more details on tuning Cloudera Search, please refer to this link
Thanks Jhon David...I'm glad you like the blog.
ReplyDeletethank you very much
ReplyDeleteExcellant post!!!. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.
ReplyDeleteBest Devops Training in pune
advanced excel training in bangalore
Pleasant Tips..Thanks for Sharing….We keep up hands on approach at work and in the workplace, keeping our business pragmatic, which recommends we can help you with your tree clearing and pruning in an invaluable and fit way.
ReplyDeletepython course in pune | python course in chennai | python course in Bangalore
Read all the information that i've given in above article. It'll give u the whole idea about it.
ReplyDeleteJava training in Bangalore | Java training in Marathahalli | Java training in Bangalore | Java training in Btm layout
Java training in Bangalore | Java training in Jaya nagar | Java training in Bangalore | Java training in Electronic city
I always enjoy reading quality articles by an individual who is obviously knowledgeable on their chosen subject. Ill be watching this post with much interest. Keep up the great work, I will be back
ReplyDeleteData Science Course in Indira nagar
Data Science Course in btm layout
Python course in Kalyan nagar
Data Science course in Indira nagar
Data Science Course in Marathahalli
Data Science Course in BTM Layout
Woah this blog is wonderful i like studying your posts. Keep up the great work! You understand, lots of persons are hunting around for this info, you could help them greatly.
ReplyDeleteaws Training in indira nagar
selenium Training in indira nagar
python Training in indira nagar
datascience Training in indira nagar
devops Training in indira nagar
Nice post..
ReplyDeletedata science training in BTM
best data science courses in BTM
data science institute in BTM
data science certification BTM
data analytics training in BTM
data science training institute in BTM
Nice way of expressing your ideas with us.
ReplyDeletethanks for sharing with us and please add more information's.
Salesforce Training in chennai
Salesforce Training in Anna Nagar
Salesforce Training Institutes in Vadapalani
Salesforce Training in T nagar
visit
ReplyDeletevisit
Attend The Data Science Course From ExcelR. Practical Data Science Course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Science Course.
ReplyDeleteData Science Course
Wow, I really enjoy reading your stuff on this page. Continually do fantastic work! You are aware that many people are looking for this information and that you may be of tremendous assistance to them.
ReplyDeleteTop CA institutes in Hyderabad