Spark classification models.

Table of content

System and experiment settings

Spark is running on a cluster of 1 master node 14 slave nodes. Each node is a work station with 16 x E5540@2.53GHz CPU and 32G memory.
Dataset used in the experiment of this post is the well-known a6a data from LibSVM website.
The file is in libsvm format which is a sparse feature representation, which can be naturally tackled/loaded by a Spark Python function.
In order to train a classification model and test it performance, we draw samples uniform at random from the original dataset which forms a training set with 80% examples and a test set with 20% examples.
The statistics of the dataset is shown in the following table

Category Size
All 11220
Training 8977
Test 2243
Feature 123
The following Spark Python code can also be deployed on Spark with other machine learning problems/datasets given the data file in libsvm format. Otherwise, a loading function should be implemented.

Category	Size
All	11220
Training	8977
Test	2243
Feature	123

Summary of results

In this section, I present an overview of results achieved by different classification models provided by Spark Python framework.
We use a same training and test split for different learning models, which in general is a 80%/20% random split.
The performance is measured by Hamming loss and is computed both on training set and test set, shown in the following table.

Training set Test set
SVM 0.1650 0.1575
LR 0.1660 0.1787
The result somehow demonstrates that on a6a dataset, SVM achieves better performance compared to logistic regression. In particular, the classification accuracy of SVM on test dataset is about 2% higher than logistic regression.

	Training set	Test set
SVM	0.1650	0.1575
LR	0.1660	0.1787

Linear classification models

Two classification learning methods will be discussed, support vector machines SVM and logistic regression LR. The application context is single label binary classification. They can also be applied to single label multiclass classification which however will not be covered in this blog post.

Load and save data files

loadLibSVMFile is the function to load data from file in libsvm format, which is a very popular file format for spark feature representation.
In particular, load data from file in libsvm format with the following command. This command will generate a Spark labelPoint data structure.
```
  parsedData = MLUtils.loadLibSVMFile(sc, "../Data/a6a")
```
It is worth noting that if you load training and test dataset separately it is possible that the dimension of the feature in training and test sets are different due to the spark representation of libsvm file format. Therefore, it is better to load the whole dataset and split for training and test later on. If you have to load training and test data separately, you can set the dimension of the feature as one of the input argument of the function.
saveAsLibSVMFile is the function to save data into a file in libsvm format.

Support vector machine SVM (code)

In general, the idea is to load a binary classification dataset in libsvm format from a file, separate training and test, perform parameter selection on training data, and make prediction on test data.
The complete Python code for running the following experiments with SVM can be found from my GitHub.

Run SVM with parameter selections

The function will take the following parameters
- data: the training data, an RDD of LabeledPoint.
- iteration: The number of iterations (default: 100).
- step: The step parameter used in SGD (default: 1.0).
- regParam: The regularizer parameter (default: 0.01).
- miniBatchFraction: Fraction of data to be used for each SGD iteration (default: 1.0).
- initialWeights: The initial weights (default: None).
- regType: l2 or l1.
It is better to check the document of the function because Spark changes rapidly and different versions might not tolerate each other. For example, my Spark is 1.4.1, I check the version of the function in Spark Document.

The following code performs a parameter selection (grid search) of SVM on training data.

# train a SVM model
numIterValList = [100,200]
regParamValList = [0.01,0.1,1,10,100]
stepSizeValList = [0.1,0.5,1]
regTypeValList = ['l2','l1']

# variable for the best parameters
bestNumIterVal = 0
bestRegParamVal = 0
bestStepSizeVal = 0
bestRegTypeVal = 0
bestTrainErr = 100

for numIterVal,regParamVal,stepSizeVal,regTypeVal in itertools.product(numIterValList,regParamValList,stepSizeValList,regTypeValList):
  model = SVMWithSGD.train(trainingData, iterations=numIterVal, regParam=regParamVal, step=stepSizeVal, regType=regTypeVal)
  labelsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
  trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(trainSize)
  if trainErr<bestTrainErr:
    bestNumIterVal = numIterVal
    bestRegParamVal = regParamVal
    bestStepSizeVal = stepSizeVal
    bestRegTypeVal = regTypeVal
    bestTrainErr = trainErr
print bestNumIterVal,bestRegParamVal,bestStepSizeVal,bestRegTypeVal,bestTrainErr

Model test

Test the performance of the model in both training data and test data by the following code block.

# Evaluating the model on training data
labelsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(trainSize)
print trainErr

# Evaluating the model on training data
labelsAndPreds = testData.map(lambda p: (p.label, model.predict(p.features)))
testErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(testSize)
print testErr

Experimental results

The result of parameter selection is shown in the following table.

|Iteration|C|Learning rate|Norm|Hamming loss| |:—|:—|:—|—:| |100|0.01|0.1|l2|0.237813720022| |100|0.01|0.1|l1|0.237813720022| |100|0.01|0.5|l2|0.186279977691| |100|0.01|0.5|l1|0.225432236475| |100|0.01|1|l2|0.16419408812| |100|0.01|1|l1|0.180479643056| |100|0.1|0.1|l2|0.237813720022| |100|0.1|0.1|l1|0.237813720022| |100|0.1|0.5|l2|0.224316787507| |100|0.1|0.5|l1|0.237813720022| |100|0.1|1|l2|0.206804238706| |100|0.1|1|l1|0.237813720022| |100|1|0.1|l2|0.237813720022| |100|1|0.1|l1|0.237813720022| |100|1|0.5|l2|0.237813720022| |100|1|0.5|l1|0.237813720022| |100|1|1|l2|0.237813720022| |100|1|1|l1|0.237813720022| |100|10|0.1|l2|0.237813720022| |100|10|0.1|l1|0.237813720022| |100|10|0.5|l2|0.237813720022| |100|10|0.5|l1|0.237813720022| |100|10|1|l2|0.237813720022| |100|10|1|l1|0.237813720022| |100|100|0.1|l2|0.237813720022| |100|100|0.1|l1|0.237813720022| |100|100|0.5|l2|0.762186279978| |100|100|0.5|l1|0.237813720022| |100|100|1|l2|0.762186279978| |100|100|1|l1|0.237813720022| |200|0.01|0.1|l2|0.237813720022| |200|0.01|0.1|l1|0.237813720022| |200|0.01|0.5|l2|0.167540435025| |200|0.01|0.5|l1|0.21226993865| |200|0.01|1|l2|0.162744004462| |200|0.01|1|l1|0.169659788065| |200|0.1|0.1|l2|0.237813720022| |200|0.1|0.1|l1|0.237813720022| |200|0.1|0.5|l2|0.215839375349| |200|0.1|0.5|l1|0.237813720022| |200|0.1|1|l2|0.204350250976| |200|0.1|1|l1|0.237813720022| |200|1|0.1|l2|0.237813720022| |200|1|0.1|l1|0.237813720022| |200|1|0.5|l2|0.237813720022| |200|1|0.5|l1|0.237813720022| |200|1|1|l2|0.237813720022| |200|1|1|l1|0.237813720022| |200|10|0.1|l2|0.237813720022| |200|10|0.1|l1|0.237813720022| |200|10|0.5|l2|0.237813720022| |200|10|0.5|l1|0.237813720022| |200|10|1|l2|0.237813720022| |200|10|1|l1|0.237813720022| |200|100|0.1|l2|0.237813720022| |200|100|0.1|l1|0.237813720022| |200|100|0.5|l2|0.762186279978| |200|100|0.5|l1|0.237813720022| |200|100|1|l2|0.762186279978| |200|100|1|l1|0.237813720022|
The best parameter is shown in the following table.

|Iteration|C|Learning rate|Norm|Hamming loss| |:—|:—|:—|—:| |200|0.01|1|l2|0.162744004462|
Hamming loss on both training and test data from SVM with the best parameter is shown in the following table.

Training set Test set
SVM 0.1650 0.1575

	Training set	Test set
SVM	0.1650	0.1575

Logistic regression LR (code)

In general, the idea is to load a binary classification dataset in libsvm format from a file, separate training and test, perform parameter selection on training data, and make prediction on test data.
The complete Python code for running the following experiments with logistic regression can be found from my GitHub.

Run LR for parameter selection

Python code for running parameter selection procedure of logistic regression is shown in the following code block

def lr(trainingData,testData,trainingSize,testSize):
'''
linear lr classifier
'''
# train a lr model
numIterValList = [100,200]
regParamValList = [0.01,0.1,1,10,100]
stepSizeValList = [0.1,0.5,1]
regTypeValList = ['l2','l1']

# variable for the best parameters
bestNumIterVal = 200
bestRegParamVal = 0.01
bestStepSizeVal = 1
bestRegTypeVal = 'l2'
bestTrainErr = 100

for numIterVal,regParamVal,stepSizeVal,regTypeVal in itertools.product(numIterValList,regParamValList,stepSizeValList,regTypeValList):
  model = LogisticRegressionWithSGD.train(trainingData, iterations=numIterVal, regParam=regParamVal, step=stepSizeVal, regType=regTypeVal)
  labelsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
  trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(trainingSize)
  if trainErr<bestTrainErr:
    bestNumIterVal = numIterVal
    bestRegParamVal = regParamVal
    bestStepSizeVal = stepSizeVal
    bestRegTypeVal = regTypeVal
    bestTrainErr = trainErr
  print numIterVal,regParamVal,stepSizeVal,regTypeVal,trainErr
print bestNumIterVal,bestRegParamVal,bestStepSizeVal,bestRegTypeVal,bestTrainErr

The performance of the best model can be computed on training and test datasets with the following code

model = LogisticRegressionWithSGD.train(trainingData, iterations=bestNumIterVal, regParam=bestRegParamVal, step=bestStepSizeVal, regType=bestRegTypeVal)

# Evaluating the model on training data
labelsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(trainingSize)
print trainErr

# Evaluating the model on training data
labelsAndPreds = testData.map(lambda p: (p.label, model.predict(p.features)))
testErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(testSize)
print testErr

Experimental results

The result of parameter selection for logistic regression is shown in the following table.

|Iteration|C|Learning rate|Norm|Hamming loss| |:—|:—|:—|—:| |100|0.01|0.1|l2|0.240232844509| |100|0.01|0.1|l1|0.240344788985| |100|0.01|0.5|l2|0.18246949513| |100|0.01|0.5|l1|0.205306168141| |100|0.01|1|l2|0.17194671443| |100|0.01|1|l1|0.179782827717| |100|0.1|0.1|l2|0.240232844509| |100|0.1|0.1|l1|0.240344788985| |100|0.1|0.5|l2|0.205306168141| |100|0.1|0.5|l1|0.240344788985| |100|0.1|1|l2|0.191536997649| |100|0.1|1|l1|0.240344788985| |100|1|0.1|l2|0.240344788985| |100|1|0.1|l1|0.240344788985| |100|1|0.5|l2|0.240344788985| |100|1|0.5|l1|0.240344788985| |100|1|1|l2|0.240344788985| |100|1|1|l1|0.240344788985| |100|10|0.1|l2|0.240344788985| |100|10|0.1|l1|0.240344788985| |100|10|0.5|l2|0.240344788985| |100|10|0.5|l1|0.240344788985| |100|10|1|l2|0.240344788985| |100|10|1|l1|0.240344788985| |100|100|0.1|l2|0.240344788985| |100|100|0.1|l1|0.240344788985| |100|100|0.5|l2|0.759655211015| |100|100|0.5|l1|0.240344788985| |100|100|1|l2|0.759655211015| |100|100|1|l1|0.240344788985| |200|0.01|0.1|l2|0.239785066607| |200|0.01|0.1|l1|0.240232844509| |200|0.01|0.5|l2|0.174857270794| |200|0.01|0.5|l1|0.188850330236| |200|0.01|1|l2|0.16791671331| |200|0.01|1|l1|0.173625881563| |200|0.1|0.1|l2|0.239785066607| |200|0.1|0.1|l1|0.240344788985| |200|0.1|0.5|l2|0.195678943244| |200|0.1|0.5|l1|0.240344788985| |200|0.1|1|l2|0.190417552894| |200|0.1|1|l1|0.240344788985| |200|1|0.1|l2|0.240344788985| |200|1|0.1|l1|0.240344788985| |200|1|0.5|l2|0.240344788985| |200|1|0.5|l1|0.240344788985| |200|1|1|l2|0.240344788985| |200|1|1|l1|0.240344788985| |200|10|0.1|l2|0.240344788985| |200|10|0.1|l1|0.240344788985| |200|10|0.5|l2|0.240344788985| |200|10|0.5|l1|0.240344788985| |200|10|1|l2|0.240344788985| |200|10|1|l1|0.240344788985| |200|100|0.1|l2|0.240344788985| |200|100|0.1|l1|0.240344788985| |200|100|0.5|l2|0.759655211015| |200|100|0.5|l1|0.240344788985| |200|100|1|l2|0.759655211015| |200|100|1|l1|0.240344788985|
The best parameter is shown in the following table.

|Iteration|C|Learning rate|Norm|Hamming loss| |:—|:—|:—|—:| |200|0.01|1|l2|0.16791671331|
Hamming loss on both training and test data from logistic regression with the best parameter is shown in the following table.

Training set Test set
LR 0.1660 0.1787

	Training set	Test set
LR	0.1660	0.1787

External reading materials

Alex Smola has a very concise blog post about parallel optimization using stochastic gradient descent with title ‘Parallel stochastic gradient descent’ :thumbsup:
NIPS paper ‘Slow learners are fast’ from John Langford and coauthors is about SGD for multicore in online learning context.
NIPS paper ‘Parallelized stochastic gradient descent’ from martin Zinkevich is about minibatch multicore SGD. Basically, it is the one used in Spark.