Spark with Python: configuration and a simple Python script
For now, let’s assume we have the Spark running in the background. Configure a Spark-Python script to make it running on cluster seems to be straight forward. For example, I would like to run a logistic regression in machine learning toolbox with Python API. The first thing I should do is to include some Python packages
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from numpy import array
from pyspark import SparkContext
from pyspark import SparkConf
Then I have to configure my script with the followings
APP_NAME = 'spark-python-lregression'
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster('spark://ukko178:7077')
sc = SparkContext(conf=conf)
The code above is the key to run a Spark-Python parallel, which is a bit different from running Spark-Scala script. After the configuration, the only thing I have to do is to use machine learning Python API to perform the logistic regress on some data. The rest of the script is
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split(' ')]
return LabeledPoint(values[0], values[1:])
sc = SparkContext(conf=conf)
data = sc.textFile("../spark-1.3.0-bin-hadoop2.4/data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)
# Build the model
model = LogisticRegressionWithSGD.train(parsedData)
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))
##Submit the script to Spark
As I already mentioned that running this script is very simple. In practice this can be done with the following command
../spark-1.3.0-bin-hadoop2.4/bin/spark-submit spark-python-lregression.py
To make sure that the submitted Spark-Python Script is running on the cluster in parallel other than on the local machine, I would check the information of the submitted job in the administration website of the Spark.
In practice, this can be done with the command line browser lynx
lynx http://localhost:8080
As expected, the submitted job is completed and appears as
Completed Applications Application ID Name Cores Memory per Node Submitted Time User State Duration
app-20150512225125-0013 spark-python-svm 80 512.0 MB 2015/05/12 22:51:25 su FINISHED 29 s
For more information, I would recommend the documentation of Spark machine learning API.
##Gradient methods in Spark MLlib Python API
The optimization problems introduced in MLlib are mostly solved by gradient based methods. I will briefly present several gradient based methods as follows
The method is used to minimize a sum of squared function values, e.g., non-linear least square problem
\[f(x) = \sum_{i=1}^{n}f_i(x)^2\]Searching for the root
Newton method aims to find the root of a function \(f(x)\) by iterative update
\[x_{n+1} = x_{n} - [J_f(x_n)]^{-1}f(x_n)\]Methods that replace exact Jacobian matrix \(J_f(x_n)\) are Quasi-Newton methods, e.g., replacing \(J_f(x_n)\) with \(J_f(x_0)\).
Searching for the optima
This is similar to searching for the root where we are looking for the foot of the gradient. The Jacobian matrix is replaced by Heissian matrix. The proposed methods usually exploit the symmetric property of the Heissian matrix.
The key idea is that the Heissian matrix does not need to be inverted. Unlike Newton method which inverts Heissian by solving a system of linear equation, the quasi-Newton methods usually directly estimate the inverted Heissian.
###Stochastic gradient descent (SGD)
###BFGS
Hongyu Su 12 May 2015