This article is to provision a EC2 cluster with Spark and Hadoop. As a result, one should be able to run Spark applications utilizing HDFS file system.
Table of content
Quick and dirty
For impatient reader, run the following script in a quick and dirty way. Remeber to replace access key and key pair with your own.
export AWS_SECRET_ACCESS_KEY=U/y3rO1/wwwzbyUe6wkzNwVG9Qb3uBdxBqiHsmcT
export AWS_ACCESS_KEY_ID=ABCAJKKNPJJRL74RPY4A
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem \
--region=eu-west-1 \
--instance-type=t2.micro \
-s 20 \
--hadoop-major-version=2 \
launch spark-cluster
Setup the Spark cluster on EC2
-
Amazon AWS account
Appearantly there needs to be an Amazon AWS account in order to use EC2 services.
-
Access key
Access key allows application to communicate the EC2 servers
-
From AWS front page, click your username on the topright corner, then choose my security credential, choose access key, click create access key.
-
Store the access key id and access key secret as a file.
-
-
Key pair
Key pair will essentially authenticate the applications/scripts with EC2 servers
-
From front page, choose service EC2 on the top left corner, click key pairs, choose create key pair
-
Name the key pair by following some pattern e.g. username+region so that key pairs from different regions will not get mixed
-
Download the file and save as .pem file
-
change permission as user can read
chmod 400 keypairfile.pem
-
-
export access key
-
Export access key with the following command using key id and key string generated from previous step
export AWS_SECRET_ACCESS_KEY=U/y3rO1/wwwzbyUe6wkzNwVG9Qb3uBdxBqiHsmcT export AWS_ACCESS_KEY_ID=ABCAJKKNPJJRL74RPY4A
-
-
Setup a Spark cluster using spark-ec2
-
Spark-ec2 package is no longer part of Spark distribution. So we need to download spark-ec2 package from its Github repository.
-
Setup a spark cluster via the following command. Name the cluster as spark-cluster.
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem \ --region=eu-west-1 \ --instance-type=t2.micro \ -s 5 \ --hadoop-major-version=2 \ launch spark-cluster-
Describe key pair file via option -i
-
Specify key name via option -k
-
Give the number of Spark slave nodes via option -s
-
Specify Hadoop version via option —hadoop-major-version
-
-
Spark version could also be specified as additional options to spark-ec2. Unfortuntely I haven’t figure out a good way to automatically build the Spark package on the master node.
--spark-version=a2c7b2133cfee7fa9abfaa2bfbfb637155466783 \ --spark-git-repo=https://github.com/apache/spark \
-
Other cluster operations
-
Login Spark cluster
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 login spark-cluster -
Stop Spark cluster
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 stop spark-cluster -
Restart Spark cluster
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 start spark-cluster -
Destroy Spark cluster
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 destroy spark-cluster -
UI
- Spark UI is available from e.g. http://ec2-54-246-255-51.eu-west-1.compute.amazonaws.com:8080
- Cluster UI is available from e.g. http://ec2-54-246-255-51.eu-west-1.compute.amazonaws.com:5080/ganglia/
- After history server being started according to the next section, event history server is available from e.g. http://ec2-54-246-255-51.eu-west-1.compute.amazonaws.com:18080
Some useful setting on Spark cluster
-
Setup event history log
-
Make a directory for event logs
cd ~ mkdir /tmp/spark-events -
Start Spark event history server and restart Spark engine
cd ~/spark/sbin ./start-history-server.sh ./stop-all.sh ;./start-all.sh -
Run a Spark application with event log enabled e.g. pySpark
spark/bin/pyspark --conf "spark.eventLog.enabled=true" spark/examples/src/main/python/wordcount.py /data/data.txt
-
Run Spark application on Spark cluster
Then I try to run a basic pySpark example on word count of Shakespears.
-
Login to master node of Spark cluster
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 login spark-clusteror
ssh -i "g1euwest.pem" root@ec2-54-246-255-51.eu-west-1.compute.amazonaws.com -
Download data and preprocessing
cd ~ mkdir tmp; cd tmp wget https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt cat t8.shakespeare.txt | sed 's/ /\n/g' > data.txt -
Move data to HDFS
ephemeral-hdfs/bin/hadoop dfs -mkdir /data/ ephemeral-hdfs/bin/hadoop dfs -put ~/tmp/data.txt /data/ -
Run pySpark word count example and activate event history logging.
spark/bin/pyspark --conf "spark.eventLog.enabled=true" spark/examples/src/main/python/wordcount.py /data/data.txt