Monday, June 30, 2014

Spark Installation Steps on CDH4 using Cloudera Manager

In this post, I will explain the steps to follow for installing Apache Spark on CDH4 cluster using Cloudera Manager. Apache Spark is a fast and general purpose fast cluster computing system with an advanced DAG execution engine that supports in-memory computation.


APACHE SPARK VERSION

As of now, the version of Spark that is packaged within CDH4 and CDH5 is 0.9. Spark 1.0 is packaged inside CDH 5.1 release which is expected to be released soon.


INSTALLATION USING PARCELS

When using CDH, it is recommended to use "Parcels" for deploying and installing packages. Parcels provide an alternative binary distribution format supported in Cloudera Manager which makes downloading, distributing and deploying and maintaining the packages much simple and easier. To learn more about parcels, click here.

In CDH5, Spark is included within the CDH parcel. However, in CDH4, you have to install CDH and Spark using separate parcel. Follow the steps below to configure, download and distribute the parcel required for Spark:
  • In the Cloudera Manager Admin Console, select Administration -> Settings.
  • Click the Parcels category.
  • Find the Remote Parcel Repository URLs property and add the location of the parcel repository (http://archive.cloudera.com/spark/parcels/latest).
  • Save Changes and click on "check for new parcels"
  • The parcel for Spark (e.g: SPARK 0.9.0-1.cdh4.6.0.p0.98) should appear. Now, you can download, distribute, and activate the parcel across the hosts in the cluster using Cloudera Manager.


MASTER NODE SETUP

Log on to the node that will run as the Spark Master role and perform the following configuration as the root user from the command line.
  • Edit /etc/spark/conf/spark-env.sh file
    • Set the environment variable STANDALONE_SPARK_MASTER_HOST to the fully qualified domain name (eg: masternode.abc.com) of the master host.
    • Uncomment and Set the environment variable DEFAULT_HADOOP_HOME to the Hadoop installation path (/opt/cloudera/parcels/CDH/lib/hadoop)
    • Few other key environment variables that can optionally be changed are given below:
Environment Variable
Description
SPARK_MASTER_IP
Binding the master to a specific IP address
SPARK_MASTER_PORT
Start the master on a different port (default: 7077).
SPARK_MASTER_WEBUI_PORT
Spark master web UI port (default: 8080).
SPARK_MASTER_OPTS
Configuration properties that apply only to the master
  • Edit the file /etc/spark/conf/slaves and enter the fully-qualified domain names of all Spark worker hosts
# A Spark Worker will be started on each of the machines listed below
worker1.abc.com
worker2.abc.com
worker3.abc.com

WORKER NODES SETUP

Copy the contents of /etc/spark/conf/ folder from the master node to all the worker nodes


Starting/Stopping Services

Master Node - Start the Master role on the Spark Master host using the following:
/opt/cloudera/parcels/SPARK/lib/spark/sbin/start-master.sh

Worker Nodes  - There is a start-slaves.sh that can be run from the master to start all worker nodes. However, this requires password-less SSH configured for root, which I wouldn't recommend. Instead, you can start the worker process from each worker node by running the following command:
nohup /opt/cloudera/parcels/SPARK/lib/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://<<master>>:7077 &

Note above that the spark master node must be provided appropriately - for me, it only worked without the domain name.

Once the master and worker processes is started, you can verify if they are running correctly by going to the Spark Master Web UI @ http://<master-node>:18080 and you can see the master status and all the worker processes running. You can also type in "spark-shell" in the command line of any worker node to start using the built-in Spark shell.

If you need to stop the master, run the following command:


/opt/cloudera/parcels/SPARK/lib/spark/sbin/stop-master.sh

Happy Sparking!

6 comments:

  1. Thanks for this. However, I suspect many like me would like to use the latest version of Spark, 1.0.1 as of this moment. Please tell me: How do I install that on CDH4 (we are at Cloudera Standard 4.7.3)?

    ReplyDelete
  2. Thanks Matthew - on CDH4 series, the latest version available through parcels is still 0.9 - if you need the latest version, then it has to be done outside of the parcels manually

    ReplyDelete
  3. I want to use CDH4.6.0 with spark. Does CDH4.6.0 has spark? I couldn't find the corresponding docs.

    ReplyDelete
  4. Spark is available as a separate parcel for CDH4 series which you can use to install - I havent tried on CDH4.6 though

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete