Spark with Python: configuration and a simple Python script

Configuration

For now, let’s assume we have the Spark running in the background. Configure a Spark-Python script to make it running on cluster seems to be straight forward. For example, I would like to run a logistic regression in machine learning toolbox with Python API. The first thing I should do is to include some Python packages

from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from numpy import array
from pyspark import SparkContext
from pyspark import SparkConf

Then I have to configure my script with the followings

APP_NAME = 'spark-python-lregression'
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster('spark://ukko178:7077')
sc = SparkContext(conf=conf)

The code above is the key to run a Spark-Python parallel, which is a bit different from running Spark-Scala script. After the configuration, the only thing I have to do is to use machine learning Python API to perform the logistic regress on some data. The rest of the script is

# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])
sc = SparkContext(conf=conf)
data = sc.textFile("../spark-1.3.0-bin-hadoop2.4/data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)
# Build the model
model = LogisticRegressionWithSGD.train(parsedData)
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))

##Submit the script to Spark

As I already mentioned that running this script is very simple. In practice this can be done with the following command

../spark-1.3.0-bin-hadoop2.4/bin/spark-submit spark-python-lregression.py

To make sure that the submitted Spark-Python Script is running on the cluster in parallel other than on the local machine, I would check the information of the submitted job in the administration website of the Spark. In practice, this can be done with the command line browser lynx

lynx http://localhost:8080

As expected, the submitted job is completed and appears as

Completed Applications Application ID Name Cores Memory per Node Submitted Time User State Duration                                                                                                                                                 
app-20150512225125-0013 spark-python-svm 80 512.0 MB 2015/05/12 22:51:25 su FINISHED 29 s

For more information, I would recommend the documentation of Spark machine learning API.

##Gradient methods in Spark MLlib Python API

The optimization problems introduced in MLlib are mostly solved by gradient based methods. I will briefly present several gradient based methods as follows

###Newton method

Newton method is developed originally to find the root $$f(x)=0$$ of a differentiable function $$f$$.
In the optimization problem where the goal is to maximize/minimize a function, Newton method is applied to find the stationary point $$f’(x)=0$$ of a twice differentiable function $$f$$. In other word, it finds the root of the derivative $$f’(x)$$.
Newton method is an iterative method. However, it will find the maxima/minima in one single iteration if the function $$f(x)$$ to be optimized is quadratic. This is because the method estimates a quadratic surface of $$f(x)$$ at any point $$x$$.
When work with high dimensional data, Newton method involves inverting a Heissian matrix which can be computationally expensive. This can be worked around with many approaches

###Gauss-Newton method

A nice short piece of text about gradient, Heissian, and Jacobian.
The method is used to minimize a sum of squared function values, e.g., non-linear least square problem

$$f(x) = \sum_{i=1}^{n}f_i(x)^2$$
In particular, the Heissian is approximated by ignoring the second order derivative term.

###Quasi-Newton method

Searching for the root
- Newton method aims to find the root of a function $$f(x)$$ by iterative update
  
  $$x_{n+1} = x_{n} - [J_f(x_n)]^{-1}f(x_n)$$
- Methods that replace exact Jacobian matrix $$J_f(x_n)$$ are Quasi-Newton methods, e.g., replacing $$J_f(x_n)$$ with $$J_f(x_0)$$.
Searching for the optima
- This is similar to searching for the root where we are looking for the foot of the gradient. The Jacobian matrix is replaced by Heissian matrix. The proposed methods usually exploit the symmetric property of the Heissian matrix.
- The key idea is that the Heissian matrix does not need to be inverted. Unlike Newton method which inverts Heissian by solving a system of linear equation, the quasi-Newton methods usually directly estimate the inverted Heissian.

###Stochastic gradient descent (SGD)

Stochastic gradient descent (SGD) where gradient is computed as the summation over a subset of examples (minibatch) located in RDD.

###BFGS

###Limit-memory BFGS (L-BFGS)