technical thoughts, tidbits and ramblings: 2014

Monday, June 30, 2014

Spark Installation Steps on CDH4 using Cloudera Manager

In this post, I will explain the steps to follow for installing Apache Spark on CDH4 cluster using Cloudera Manager. Apache Spark is a fast and general purpose fast cluster computing system with an advanced DAG execution engine that supports in-memory computation.

APACHE SPARK VERSION

As of now, the version of Spark that is packaged within CDH4 and CDH5 is 0.9. Spark 1.0 is packaged inside CDH 5.1 release which is expected to be released soon.

INSTALLATION USING PARCELS

When using CDH, it is recommended to use "Parcels" for deploying and installing packages. Parcels provide an alternative binary distribution format supported in Cloudera Manager which makes downloading, distributing and deploying and maintaining the packages much simple and easier. To learn more about parcels, click here.

In CDH5, Spark is included within the CDH parcel. However, in CDH4, you have to install CDH and Spark using separate parcel. Follow the steps below to configure, download and distribute the parcel required for Spark:

In the Cloudera Manager Admin Console, select Administration -> Settings.
Click the Parcels category.
Find the Remote Parcel Repository URLs property and add the location of the parcel repository (http://archive.cloudera.com/spark/parcels/latest).
Save Changes and click on "check for new parcels"
The parcel for Spark (e.g: SPARK 0.9.0-1.cdh4.6.0.p0.98) should appear. Now, you can download, distribute, and activate the parcel across the hosts in the cluster using Cloudera Manager.

MASTER NODE SETUP

Log on to the node that will run as the Spark Master role and perform the following configuration as the root user from the command line.

Edit /etc/spark/conf/spark-env.sh file

Set the environment variable STANDALONE_SPARK_MASTER_HOST to the fully qualified domain name (eg: masternode.abc.com) of the master host.
Uncomment and Set the environment variable DEFAULT_HADOOP_HOME to the Hadoop installation path (/opt/cloudera/parcels/CDH/lib/hadoop)
Few other key environment variables that can optionally be changed are given below:

Environment Variable	Description
SPARK_MASTER_IP	Binding the master to a specific IP address
SPARK_MASTER_PORT	Start the master on a different port (default: 7077).
SPARK_MASTER_WEBUI_PORT	Spark master web UI port (default: 8080).
SPARK_MASTER_OPTS	Configuration properties that apply only to the master

Edit the file /etc/spark/conf/slaves and enter the fully-qualified domain names of all Spark worker hosts

# A Spark Worker will be started on each of the machines listed below
worker1.abc.com
worker2.abc.com
worker3.abc.com

WORKER NODES SETUP

Copy the contents of /etc/spark/conf/ folder from the master node to all the worker nodes

Starting/Stopping Services

Master Node - Start the Master role on the Spark Master host using the following:

/opt/cloudera/parcels/SPARK/lib/spark/sbin/start-master.sh

Worker Nodes - There is a start-slaves.sh that can be run from the master to start all worker nodes. However, this requires password-less SSH configured for root, which I wouldn't recommend. Instead, you can start the worker process from each worker node by running the following command:

nohup /opt/cloudera/parcels/SPARK/lib/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://<<master>>:7077 &

Note above that the spark master node must be provided appropriately - for me, it only worked without the domain name.

Once the master and worker processes is started, you can verify if they are running correctly by going to the Spark Master Web UI @ http://<master-node>:18080 and you can see the master status and all the worker processes running. You can also type in "spark-shell" in the command line of any worker node to start using the built-in Spark shell.

If you need to stop the master, run the following command:

/opt/cloudera/parcels/SPARK/lib/spark/sbin/stop-master.sh

Happy Sparking!

Sunday, June 15, 2014

Performance Tuning and Optimization for Cloudera Search

In this post, I will explain some of the key factors that should be considered while optimizing performance (for querying and indexing) on implementations using Cloudera Search. Since Cloudera Search is built on top of SolrCloud, most of the performance considerations that are applicable for SOLR applications are still applicable for Cloudera Search. Please refer to this link for an overview of the main factors that affect SOLR performance (in general).

Below are some of the main factors that I found useful while optimizing performance on Cloudera Search

Block Caching
As we know, Cloudera Search uses HDFS filesystem to store the indexes. In order to optimize the performance, HDFS Block cache option is available. The block cahce works by caching the HDFS index blocks in JVM direct memory. Block cache can be enabled in solrconfig.xml. Below are the key parameters involved in tuning the block cache:

1. Enable Block Cache - by setting solr.hdfs.blockcache.enabled to be true. Once block cache is enabled, the read and write caches can be enabled/disabled separately through the following settings:

solr.hdfs.blockcache.read.enabled

solr.hdfs.blockcache.write.enabled

There is a known issue with enabling block cache writing which may lead to irrevocable corrupt indexes. So, it is very important that this is disabled by setting solr.hdfs.blockcache.read.enabled to be false.

2. Configure Memory Slab Count - The slab count determines the number of memory slabs to allocate, where each slab is 128 MB. Allocate the slab count to a sufficiently higher number that is required for the specific application (based on schema and query access patterns)
This is done by setting the solr.hdfs.blockcache.slab.count parameter

3. Enable Global Block Cache - Enabling the global block cache would allow multiple solrcores on the same node to share a global HDFS block cache. It is done by setting solr.hdfs.blockcache.global to be true

4. NRTCachingDirectory - If using the Near Real-Time (NRT) setup, then enabling the NRTCachingDirectory (solr.hdfs.nrtcachingdirectory.enable) and tuning the maxCachedmb (solr.hdfs.nrtcachingdirectory.maxcachedmb) and merge size (solr.hdfs.nrtcachingdirectory.maxmergesizemb) thresholds also help.

Below is the xml section (in solrconfig.xml) with the above parameters:

<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
  <bool name="solr.hdfs.blockcache.enabled">true</bool>
  <bool name="solr.hdfs.blockcache.read.enabled">true</bool> 
  <bool name="solr.hdfs.blockcache.write.enabled">false</bool>
  <int name="solr.hdfs.blockcache.slab.count">100</int>
  <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
  <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
  <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
  <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">64</int>
  <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">1024</int>
</directoryFactory>

SolrCloud Caching
The caching related settings play a major role in influencing the query performance/response time. Since SOLR caching are based on the Index Searches, the caches are available as long as the searchers are valid. Solr has support for FastLRUCache which has faster reads which can be set via (Solr.FastLRUCache) in solrconfig.xml file. There are 3 types of caching available in SOLR

FilterCache - This cache stores unordered sets of document IDs matching the queries. It stores the results of any filter queries (fq) by executing and caching each filter separately. Setting this cache to sufficiently higher value helps in caching commonly used filter queries. Below is a sample configuration for the filter cache:

QueryResultCache - This cache stores the top N results of a query. Since it stores only the document IDs returned by the query, the memory usage of this cache is less compared to that of the filterCache.

DocumentCache - This cache stores the Lucene Document objects that have been fetched from disk. The size of the documentCache memory is dependent on the number and type of fields stored in the document.

Below is a sample configuration setting for the above three cache types:

<filterCache class="solr.FastLRUCache"
                 size="20000"
                 initialSize="5000"
                 autowarmCount="1000"/>

<queryResultCache class="solr.FastLRUCache"
                     size="10000"
                     initialSize="5000"
                     autowarmCount="1000"/>

<documentCache class="solr.FastLRUCache"
                   size="5000"
                   initialSize="2000"
                   autowarmCount="1000"/>

Disable Swapping
As with other hadoop systems, it is recommended to disable Linux swapping on all solr nodes.

To get more details on tuning Cloudera Search, please refer to this link

Friday, June 13, 2014

Handling OP_READ_BLOCK and File Not Found error in Cloudera Search

While using Cloudera Search in a (Near Real-Time) NRT scenario, occasionally during periods of high writes (index updates), I see the following error thrown on the solr logs

Failed to connect to /0.0.0.0:50010 for file /solr/collectionName/core_node1/data/index.20140528010609412/_zsae_Lucene41_0.doc for block BP-658256793-10.130.36.84-1390532185717:blk_-197867398637806450_45226458:java.io.IOException: Got error for OP_READ_BLOCK, self=/0.0.0.0:42978, remote=/0.0.0.0:50010, for file /solr/collectionName/core_node1/data/index.20140528010609412/_zsae_Lucene41_0.doc, for pool BP-658256793-10.130.36.84-1390532185717 block -197867398637806450_45226458

DFS chooseDataNode: got # 1 IOException, will wait for 1064.4631017325812 msec.

java.io.FileNotFoundException: File does not exist: /solr/collectionName/core_node1/data/index.20140528010609412/_zsae_Lucene41_0.doc

On troubleshooting further, the logs on datanode side show the following READ_BLOCK errors

DatanodeRegistration(<serverip>, storageID=DS-1520167466-0.0.0.0-50010-1389140294030, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=cluster25;nsid=1124238169;c=0):Got exception while serving BP-1661432518-0.0.0.0-1389140286721:blk_5701049520037157281_7691487 to /ip:45293
org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not found for BP-1661432518-10.130.71.219-1389140286721:blk_5701049520037157281_7691487
 at org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:382)
 at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:193)
 at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:326)
 at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
 at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
 at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
 at java.lang.Thread.run(Thread.java:662)

node1:50010:DataXceiver error processing READ_BLOCK operation  src: /0.0.0.0:45293 dest: /0.0.0.0:50010
org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not found for BP-1661432518-0.0.0.0-1389140286721:blk_5701049520037157281_7691487
 at org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:382)
 at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:193)
 at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:326)
 at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
 at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
 at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
 at java.lang.Thread.run(Thread.java:662)

This issue is related to this bug (https://issues.apache.org/jira/browse/SOLR-5693), where HDFS file merging does not work correctly with NRT search. This would lead to search failure with file not found exceptions. Again occurrence of the error depends on the velocity of the data changes. The issue has been backported to Cloudera Search and is supposed to be fixed with the next release of Search.

In versions 1.2 and below, a workaround to minimize the occurence of this failure is to do more frequent commits (reduce the autocommit frequency)

In the event of a catastrophic failure, when the replica goes down because of this issue and does NOT recover, you might have to manually copy over the index files on HDFS from another replica which is up and running. The error log specifies the actual index files that are missing because of the merge.

The index data files are usually stored in HDFS under /solr/{collection-name}/{core_node#}/data

There are two files in this directory which are important in identifying the state of the index:

index.properties - this file shows the current active index folder something like this:

#index.properties
#Wed Apr 30 23:57:38 PDT 2014
index=index.20140430235104131

replication.properties - this file shows replication state of the core

#Replication details
#Fri May 02 15:57:16 PDT 2014
previousCycleTimeInSeconds=486
indexReplicatedAtList=1399071436498,1399052241128,1398927458820,1398833578758,1398833453263,1398799104838
indexReplicatedAt=1399071436498
timesIndexReplicated=22
lastCycleBytesDownloaded=6012236791

I have used the "File Browser" application inside "Hue" to browse the files to get a sense of where the files are located and which ones are missing. Below are the steps to fix the issue:

Identify the core (eg: core_node1) and the file(s) missing from the logs
Stop the solr process on the node with the failure. You can use Cloudera Manager to do this.
Copy the index folder (index.xxxxx)from the replica that is working to the failed node. The copy can be done by either:

Using "hadoop fs -cp" command
Using Hue application.

4. Copy the index.properties and replication.properties file.
5. Restart the service.

In both cases, the copy operation has to be done in the context of a user (eg: hdfs) which has write permissions on the HDFS location.

Thursday, June 12, 2014

Real-time Clickstream Analytics using Flume, Avro, Kite Morphlines and Impala

In this post, I will explain how to capture clickstream logs from webservers and ingest into Hive/Impala in real-time for clickstream analytics. We will ingest clickstream logs in realtime using Flume, transform them using Morphlines and store them as avro records in HDFS, which can be used for analytics in Hive and Impala.

Apache Flume
Apache Flume is a high performance system for data collection which is ideal for real-time log data ingestion. It is horizontally scalable and very extensible as we will soon see below.Flume supports a wide variety of sources to pull data from - I have used the spoolingdirectory source which expects immutable files to placed in a particular directory from where the files are processed. The spooling source guarantees reliability of delivery of the events as long as the files are "immutable and uniquely named".

In my case, the clickstream logs are written on the webserver as csv files. The csv files are rotated every minute and copied to the spooling directory source location by a cron job. The current version of Flume (1.4) does not support recursive polling of directory for the spool source. So, I wrote a shell script that recursively finds .csv files from the source location and flattens them at the destination folder (using the file path as the file name).

Below is a sample shell script that does the job:

#!/bin/sh

TS=`date +%Y%m%d%H%M.%S`

SOURCEDIR=/xx/source

DESTDIR=/xx/destination

touch -t $TS $SOURCEDIR/$TS

# find older logs, flatten and move to destination
for FILE in `find $SOURCEDIR -name "*.csv" ! -newer $SOURCEDIR/$TS`
do
  NEWFILE=`echo $FILE | tr '/' '-'`
  mv $FILE $DESTDIR/$NEWFILE
done

rm $SOURCEDIR/$TS

Avro Format and Schema
Avro defines a data format designed to support data-intensive applications - it provides both RPC and serialization framework. Avro allows for schema evolution in a scalable manner and Hive and Impala both support avro formated records. We need to first create a avro schema file that represents the avro records that we are going to persit. Below is a sample format for reference:

{
 "namespace": "company.schema",
 "type": "record",
 "name": "Clickstream",
 "fields": [
  {"name": "timestamp", "type": "long"},
  {"name": "client_ip", "type": "string"},
  {"name": "request_type", "type": ["null", "string"]},
  {"name": "request_url", "type": ["null", "string"]}
 ]
}

Once the avro schema file is created, copy to the schema to the flume agent

sudo mkdir /etc/flume-ng/schemas
sudo cp clickstream.avsc /etc/flume-ng/schemas/clickstream.avsc

Kite SDK
Kite is a set of libraries and tools (contributed by Cloudera) that aims to making building systems on top of Hadoop easier. One of the module that it provides is a "Dataset" module that provides logical abstractions on top of persitence layer. The kite maven plugin also provides Maven goals for packaging, deploying, and running distributed applications. Please refer here to installation steps for Kite. Once installed, you can use the kite maven command to create the dataset on HDFS

mvn kite:create-dataset -Dkite.rootDirectory=/data/clickstream
  -Dkite.datasetName=clickstream \
  -Dkite.avroSchemaFile=/etc/flume-ng/schemas/clickstream.avsc

CSV Logs to Avro Records
In order to transform the events captured from the csv files into a format that can be used for further processing, we can use Flume Interceptors.

Flume Interceptors provide on-the-fly inspection and modification of events to create a customized data flow for our applications. Once an interceptor is attached to a source, it is invoked on every event as it travels between a source and a channel. In addition, multiple interceptors can be chained together in a sequence (as shown in the below diagram).

In this case, we can use an interceptor to transform the csv file to avro formatted records. The interceptor will be implemented using Morphlines, which is a powerful open source framework for doing ETL transformations in Hadoop applications. A morphline is an in-memory container for transformation commands that are chained together.To transform the records into Avro, we can attache a flume morphline interceptor to the source.

A morphline is defined through a configuration file which contains the commands. Below is a morphline to convert csv to avro records

morphlines: [
  {
    id: convertClickStreamLogsToAvro
    importCommands: ["com.cloudera.**", "org.kitesdk.**" ]
    commands: [

      { tryRules {
    catchExceptions : false
    throwExceptionIfAllRulesFailed : true
    rules : [
   # next rule of tryRules cmd:
   {
     commands : [
     { readCSV: {
     separator : ","
     columns : [timestamp,client_ip,request_type,request_uri]
     trim: true
     charset : UTF-8
     quoteChar : "\""
     } 
    }

    {      
     toAvro {
        
      schemaFile: /etc/flume-ng/schemas/clickstream.avsc
  
     } 
   
    }
    { 
     writeAvroToByteArray: {
        
      format: containerlessBinary
  
     } 
   
    }
   ]
   }
   # next rule of tryRules cmd:
   {
     commands : [
    { dropRecord {} }
     ]
   }
       
  ]
    }
  }

  { logTrace { format : "output record: {}", args : ["@{}"] } }  
    ]
  }


]

Now, copy the morphline file to the flume agent:

sudo cp morphline.conf /etc/flume-ng/conf/morphline.conf

Flume Configuration
The events are forwarded from the source to an HDFS sink through a memory channel. The HDFS sink writes the events to a file where the path of the file is specified by the configuration. The flume properties file has to be setup to specify the source, channel and sink.

clickstreamAgent.channels = mem-channel-hdfs
clickstreamAgent.sources = spoolSource
clickstreamAgent.sinks = clickstream-dataset

Channels Setup
The below section shows the memory channel setup

clickstreamAgent.channels.mem-channel-hdfs.type = memory
clickstreamAgent.channels.mem-channel-hdfs.capacity = 10000000
clickstreamAgent.channels.mem-channel-hdfs.transactionCapacity = 1000

Source and Interceptor Setup
The below section shows the source and the two interceptors that are attached to the source:

clickstreamAgent.sources.spoolSource.type = spooldir
clickstreamAgent.sources.spoolSource.channels = mem-channel-hdfs
clickstreamAgent.sources.spoolSource.spoolDir = /xxx/folder
clickstreamAgent.sources.spoolSource.deletePolicy=immediate

clickstreamAgent.sources.spoolSource.interceptors = attach-schema morphline

# add the schema for our record sink
clickstreamAgent.sources.spoolSource.interceptors.attach-schema.type = static
clickstreamAgent.sources.spoolSource.interceptors.attach-schema.key = flume.avro.schema.url
clickstreamAgent.sources.spoolSource.interceptors.attach-schema.value = file:/etc/flume-ng/schemas/clickstream.avsc

# morphline interceptor config
clickstreamAgent.sources.spoolSource.interceptors.morphline.type = org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
clickstreamAgent.sources.spoolSource.interceptors.morphline.morphlineFile = /etc/flume-ng/conf/morphline.conf
clickstreamAgent.sources.spoolSource.interceptors.morphline.morphlineId = convertClickStreamLogsToAvro

Sink Configuration
The below section shows the HDFS sink setup

# store the clickstream in the avro Dataset
clickstreamAgent.sinks.clickstream-dataset.type = hdfs
clickstreamAgent.sinks.clickstream-dataset.channel = mem-channel-hdfs
clickstreamAgent.sinks.clickstream-dataset.hdfs.path = /data/clickstream
clickstreamAgent.sinks.clickstream-dataset.hdfs.batchSize = 100
clickstreamAgent.sinks.clickstream-dataset.hdfs.fileType = DataStream
clickstreamAgent.sinks.clickstream-dataset.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder

Below is the complete configuration file (flume.properties)

clickstreamAgent.channels = mem-channel-hdfs
clickstreamAgent.sources = spoolSource
clickstreamAgent.sinks = clickstream-dataset

clickstreamAgent.channels.mem-channel-hdfs.type = memory
clickstreamAgent.channels.mem-channel-hdfs.capacity = 10000000
clickstreamAgent.channels.mem-channel-hdfs.transactionCapacity = 1000

clickstreamAgent.sources.spoolSource.type = spooldir
clickstreamAgent.sources.spoolSource.channels = mem-channel-hdfs
clickstreamAgent.sources.spoolSource.spoolDir = /xxx/folder
clickstreamAgent.sources.spoolSource.deletePolicy=immediate

clickstreamAgent.sources.spoolSource.interceptors = attach-schema morphline

# add the schema for our record sink
clickstreamAgent.sources.spoolSource.interceptors.attach-schema.type = static
clickstreamAgent.sources.spoolSource.interceptors.attach-schema.key = flume.avro.schema.url
clickstreamAgent.sources.spoolSource.interceptors.attach-schema.value = file:/etc/flume-ng/schemas/clickstream.avsc

# morphline interceptor config
clickstreamAgent.sources.spoolSource.interceptors.morphline.type = org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
clickstreamAgent.sources.spoolSource.interceptors.morphline.morphlineFile = /etc/flume-ng/conf/morphline.conf
clickstreamAgent.sources.spoolSource.interceptors.morphline.morphlineId = convertClickStreamLogsToAvro

# store the clickstream in the avro Dataset
clickstreamAgent.sinks.clickstream-dataset.type = hdfs
clickstreamAgent.sinks.clickstream-dataset.channel = mem-channel-hdfs
clickstreamAgent.sinks.clickstream-dataset.hdfs.path = /data/clickstream
clickstreamAgent.sinks.clickstream-dataset.hdfs.batchSize = 100
clickstreamAgent.sinks.clickstream-dataset.hdfs.fileType = DataStream
clickstreamAgent.sinks.clickstream-dataset.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder

The configuration can be updated by copying the flume.properties to (/etc/flume-ng/conf) or using Cloudera Manager (Configuration->Default Agent)

Testing
Once all the above setup is done, you can start the flume agent and monitor the logs to see if the events are being processed. After sending some sample logs, you can see the files being written to HDFS (/data/clickstream).

In addition, the data can be queried on Hive and Impala.

[xxx:xxx:21000] > invalidate metadata;
Query: invalidate metadata
Query finished, fetching results ...


[xxx:xxx:21000]> show tables;
Query: show tables
Query finished, fetching results ...
+-------+
| name       |
+-------+
| clickstream|
+-------+

[xxx:xxx:21000]> select client_ip from clickstream;
Query: select client_ip from clickstream

Query finished, fetching results ...+----------------------+
| client_ip            |
+----------------------+
| 127.0.0.1            |
| 0.0.0.0              |
+----------------------+