Spark with Python: configuration and a simple Python script

Spark with Python: configuration and a simple Python script

Configuration

For now, let’s assume we have the Spark running in the background. Configure a Spark-Python script to make it running on cluster seems to be straight forward. For example, I would like to run a logistic regression in machine learning toolbox with Python API. The first thing I should do is to include some Python packages

from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from numpy import array
from pyspark import SparkContext
from pyspark import SparkConf

Then I have to configure my script with the followings

APP_NAME = 'spark-python-lregression'
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster('spark://ukko178:7077')
sc = SparkContext(conf=conf)

The code above is the key to run a Spark-Python parallel, which is a bit different from running Spark-Scala script. After the configuration, the only thing I have to do is to use machine learning Python API to perform the logistic regress on some data. The rest of the script is

# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])
sc = SparkContext(conf=conf)
data = sc.textFile("../spark-1.3.0-bin-hadoop2.4/data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)
# Build the model
model = LogisticRegressionWithSGD.train(parsedData)
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))

##Submit the script to Spark

As I already mentioned that running this script is very simple. In practice this can be done with the following command

../spark-1.3.0-bin-hadoop2.4/bin/spark-submit spark-python-lregression.py

To make sure that the submitted Spark-Python Script is running on the cluster in parallel other than on the local machine, I would check the information of the submitted job in the administration website of the Spark. In practice, this can be done with the command line browser lynx

lynx http://localhost:8080

As expected, the submitted job is completed and appears as

Completed Applications Application ID Name Cores Memory per Node Submitted Time User State Duration                                                                                                                                                 
app-20150512225125-0013 spark-python-svm 80 512.0 MB 2015/05/12 22:51:25 su FINISHED 29 s

For more information, I would recommend the documentation of Spark machine learning API.

##Gradient methods in Spark MLlib Python API

The optimization problems introduced in MLlib are mostly solved by gradient based methods. I will briefly present several gradient based methods as follows

###Newton method

###Gauss-Newton method

###Quasi-Newton method

###Stochastic gradient descent (SGD)

###BFGS

###Limit-memory BFGS (L-BFGS)

Hongyu Su 12 May 2015