This article is to provision a EC2 cluster with Spark and Hadoop. As a result, one should be able to run Spark applications utilizing HDFS file system.

Table of content

Quick and dirty

For impatient reader, run the following script in a quick and dirty way. Remeber to replace access key and key pair with your own.

export AWS_SECRET_ACCESS_KEY=U/y3rO1/wwwzbyUe6wkzNwVG9Qb3uBdxBqiHsmcT
export AWS_ACCESS_KEY_ID=ABCAJKKNPJJRL74RPY4A
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem \
        --region=eu-west-1 \
        --instance-type=t2.micro \
        -s 20 \
        --hadoop-major-version=2 \
        launch spark-cluster

Setup the Spark cluster on EC2

Amazon AWS account

Appearantly there needs to be an Amazon AWS account in order to use EC2 services.
Access key

Access key allows application to communicate the EC2 servers
- From AWS front page, click your username on the topright corner, then choose my security credential, choose access key, click create access key.
- Store the access key id and access key secret as a file.
Key pair

Key pair will essentially authenticate the applications/scripts with EC2 servers
- From front page, choose service EC2 on the top left corner, click key pairs, choose create key pair
- Name the key pair by following some pattern e.g. username+region so that key pairs from different regions will not get mixed
- Download the file and save as .pem file
- change permission as user can read
```
chmod 400 keypairfile.pem
```

export access key

Export access key with the following command using key id and key string generated from previous step

export AWS_SECRET_ACCESS_KEY=U/y3rO1/wwwzbyUe6wkzNwVG9Qb3uBdxBqiHsmcT
export AWS_ACCESS_KEY_ID=ABCAJKKNPJJRL74RPY4A

Setup a Spark cluster using spark-ec2
- Spark-ec2 package is no longer part of Spark distribution. So we need to download spark-ec2 package from its Github repository.
- Setup a spark cluster via the following command. Name the cluster as spark-cluster.
```
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem \
--region=eu-west-1 \
--instance-type=t2.micro \
-s 5 \
--hadoop-major-version=2 \
launch spark-cluster
```
  - Describe key pair file via option -i
  - Specify key name via option -k
  - Give the number of Spark slave nodes via option -s
  - Specify Hadoop version via option —hadoop-major-version
- Spark version could also be specified as additional options to spark-ec2. Unfortuntely I haven’t figure out a good way to automatically build the Spark package on the master node.
```
--spark-version=a2c7b2133cfee7fa9abfaa2bfbfb637155466783 \
--spark-git-repo=https://github.com/apache/spark \
```

Other cluster operations

/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 login spark-cluster

Stop Spark cluster

/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 stop spark-cluster

Restart Spark cluster

/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 start spark-cluster

Destroy Spark cluster

/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 destroy spark-cluster

UI
- Spark UI is available from e.g. http://ec2-54-246-255-51.eu-west-1.compute.amazonaws.com:8080
- Cluster UI is available from e.g. http://ec2-54-246-255-51.eu-west-1.compute.amazonaws.com:5080/ganglia/
- After history server being started according to the next section, event history server is available from e.g. http://ec2-54-246-255-51.eu-west-1.compute.amazonaws.com:18080

Some useful setting on Spark cluster

Setup event history log

Make a directory for event logs
```
cd ~
mkdir /tmp/spark-events
```

Start Spark event history server and restart Spark engine

cd ~/spark/sbin
./start-history-server.sh
./stop-all.sh ;./start-all.sh

Run a Spark application with event log enabled e.g. pySpark

spark/bin/pyspark --conf "spark.eventLog.enabled=true" spark/examples/src/main/python/wordcount.py /data/data.txt

Run Spark application on Spark cluster

Then I try to run a basic pySpark example on word count of Shakespears.

/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 login spark-cluster

ssh -i "g1euwest.pem" root@ec2-54-246-255-51.eu-west-1.compute.amazonaws.com

Download data and preprocessing

cd ~
mkdir tmp; cd tmp
wget https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt  
cat t8.shakespeare.txt | sed 's/ /\n/g' > data.txt

Move data to HDFS

ephemeral-hdfs/bin/hadoop dfs -mkdir /data/
ephemeral-hdfs/bin/hadoop dfs -put ~/tmp/data.txt /data/

Run pySpark word count example and activate event history logging.

spark/bin/pyspark --conf "spark.eventLog.enabled=true" spark/examples/src/main/python/wordcount.py /data/data.txt

Alternatives

We could also look into flintrock. I tried and it seems faster than spark-ec2