Spark classification models.
libsvm
format which is a sparse feature representation, which can be naturally tackled/loaded by a Spark Python function.The statistics of the dataset is shown in the following table
Category | Size |
---|---|
All | 11220 |
Training | 8977 |
Test | 2243 |
Feature | 123 |
libsvm
format. Otherwise, a loading function should be implemented.The performance is measured by Hamming loss and is computed both on training set and test set, shown in the following table.
Training set | Test set | |
---|---|---|
SVM | 0.1650 | 0.1575 |
LR | 0.1660 | 0.1787 |
Two classification learning methods will be discussed, support vector machines SVM and logistic regression LR. The application context is single label binary classification. They can also be applied to single label multiclass classification which however will not be covered in this blog post.
loadLibSVMFile
is the function to load data from file in libsvm
format, which is a very popular file format for spark feature representation.In particular, load data from file in libsvm
format with the following command. This command will generate a Spark labelPoint data structure.
parsedData = MLUtils.loadLibSVMFile(sc, "../Data/a6a")
libsvm
file format. Therefore, it is better to load the whole dataset and split for training and test later on. If you have to load training and test data separately, you can set the dimension of the feature as one of the input argument of the function.saveAsLibSVMFile
is the function to save data into a file in libsvm
format.libsvm
format from a file, separate training and test, perform parameter selection on training data, and make prediction on test data.It is better to check the document of the function because Spark changes rapidly and different versions might not tolerate each other. For example, my Spark is 1.4.1, I check the version of the function in Spark Document.
The following code performs a parameter selection (grid search) of SVM on training data.
# train a SVM model
numIterValList = [100,200]
regParamValList = [0.01,0.1,1,10,100]
stepSizeValList = [0.1,0.5,1]
regTypeValList = ['l2','l1']
# variable for the best parameters
bestNumIterVal = 0
bestRegParamVal = 0
bestStepSizeVal = 0
bestRegTypeVal = 0
bestTrainErr = 100
for numIterVal,regParamVal,stepSizeVal,regTypeVal in itertools.product(numIterValList,regParamValList,stepSizeValList,regTypeValList):
model = SVMWithSGD.train(trainingData, iterations=numIterVal, regParam=regParamVal, step=stepSizeVal, regType=regTypeVal)
labelsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(trainSize)
if trainErr<bestTrainErr:
bestNumIterVal = numIterVal
bestRegParamVal = regParamVal
bestStepSizeVal = stepSizeVal
bestRegTypeVal = regTypeVal
bestTrainErr = trainErr
print bestNumIterVal,bestRegParamVal,bestStepSizeVal,bestRegTypeVal,bestTrainErr
Test the performance of the model in both training data and test data by the following code block.
# Evaluating the model on training data
labelsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(trainSize)
print trainErr
# Evaluating the model on training data
labelsAndPreds = testData.map(lambda p: (p.label, model.predict(p.features)))
testErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(testSize)
print testErr
The result of parameter selection is shown in the following table.
Iteration | C | Learning rate | Norm | Hamming loss |
---|---|---|---|---|
100 | 0.01 | 0.1 | l2 | 0.237813720022 |
100 | 0.01 | 0.1 | l1 | 0.237813720022 |
100 | 0.01 | 0.5 | l2 | 0.186279977691 |
100 | 0.01 | 0.5 | l1 | 0.225432236475 |
100 | 0.01 | 1 | l2 | 0.16419408812 |
100 | 0.01 | 1 | l1 | 0.180479643056 |
100 | 0.1 | 0.1 | l2 | 0.237813720022 |
100 | 0.1 | 0.1 | l1 | 0.237813720022 |
100 | 0.1 | 0.5 | l2 | 0.224316787507 |
100 | 0.1 | 0.5 | l1 | 0.237813720022 |
100 | 0.1 | 1 | l2 | 0.206804238706 |
100 | 0.1 | 1 | l1 | 0.237813720022 |
100 | 1 | 0.1 | l2 | 0.237813720022 |
100 | 1 | 0.1 | l1 | 0.237813720022 |
100 | 1 | 0.5 | l2 | 0.237813720022 |
100 | 1 | 0.5 | l1 | 0.237813720022 |
100 | 1 | 1 | l2 | 0.237813720022 |
100 | 1 | 1 | l1 | 0.237813720022 |
100 | 10 | 0.1 | l2 | 0.237813720022 |
100 | 10 | 0.1 | l1 | 0.237813720022 |
100 | 10 | 0.5 | l2 | 0.237813720022 |
100 | 10 | 0.5 | l1 | 0.237813720022 |
100 | 10 | 1 | l2 | 0.237813720022 |
100 | 10 | 1 | l1 | 0.237813720022 |
100 | 100 | 0.1 | l2 | 0.237813720022 |
100 | 100 | 0.1 | l1 | 0.237813720022 |
100 | 100 | 0.5 | l2 | 0.762186279978 |
100 | 100 | 0.5 | l1 | 0.237813720022 |
100 | 100 | 1 | l2 | 0.762186279978 |
100 | 100 | 1 | l1 | 0.237813720022 |
200 | 0.01 | 0.1 | l2 | 0.237813720022 |
200 | 0.01 | 0.1 | l1 | 0.237813720022 |
200 | 0.01 | 0.5 | l2 | 0.167540435025 |
200 | 0.01 | 0.5 | l1 | 0.21226993865 |
200 | 0.01 | 1 | l2 | 0.162744004462 |
200 | 0.01 | 1 | l1 | 0.169659788065 |
200 | 0.1 | 0.1 | l2 | 0.237813720022 |
200 | 0.1 | 0.1 | l1 | 0.237813720022 |
200 | 0.1 | 0.5 | l2 | 0.215839375349 |
200 | 0.1 | 0.5 | l1 | 0.237813720022 |
200 | 0.1 | 1 | l2 | 0.204350250976 |
200 | 0.1 | 1 | l1 | 0.237813720022 |
200 | 1 | 0.1 | l2 | 0.237813720022 |
200 | 1 | 0.1 | l1 | 0.237813720022 |
200 | 1 | 0.5 | l2 | 0.237813720022 |
200 | 1 | 0.5 | l1 | 0.237813720022 |
200 | 1 | 1 | l2 | 0.237813720022 |
200 | 1 | 1 | l1 | 0.237813720022 |
200 | 10 | 0.1 | l2 | 0.237813720022 |
200 | 10 | 0.1 | l1 | 0.237813720022 |
200 | 10 | 0.5 | l2 | 0.237813720022 |
200 | 10 | 0.5 | l1 | 0.237813720022 |
200 | 10 | 1 | l2 | 0.237813720022 |
200 | 10 | 1 | l1 | 0.237813720022 |
200 | 100 | 0.1 | l2 | 0.237813720022 |
200 | 100 | 0.1 | l1 | 0.237813720022 |
200 | 100 | 0.5 | l2 | 0.762186279978 |
200 | 100 | 0.5 | l1 | 0.237813720022 |
200 | 100 | 1 | l2 | 0.762186279978 |
200 | 100 | 1 | l1 | 0.237813720022 |
The best parameter is shown in the following table.
Iteration | C | Learning rate | Norm | Hamming loss |
---|---|---|---|---|
200 | 0.01 | 1 | l2 | 0.162744004462 |
Hamming loss on both training and test data from SVM with the best parameter is shown in the following table.
Training set | Test set | |
---|---|---|
SVM | 0.1650 | 0.1575 |
libsvm
format from a file, separate training and test, perform parameter selection on training data, and make prediction on test data.Python code for running parameter selection procedure of logistic regression is shown in the following code block
def lr(trainingData,testData,trainingSize,testSize):
'''
linear lr classifier
'''
# train a lr model
numIterValList = [100,200]
regParamValList = [0.01,0.1,1,10,100]
stepSizeValList = [0.1,0.5,1]
regTypeValList = ['l2','l1']
# variable for the best parameters
bestNumIterVal = 200
bestRegParamVal = 0.01
bestStepSizeVal = 1
bestRegTypeVal = 'l2'
bestTrainErr = 100
for numIterVal,regParamVal,stepSizeVal,regTypeVal in itertools.product(numIterValList,regParamValList,stepSizeValList,regTypeValList):
model = LogisticRegressionWithSGD.train(trainingData, iterations=numIterVal, regParam=regParamVal, step=stepSizeVal, regType=regTypeVal)
labelsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(trainingSize)
if trainErr<bestTrainErr:
bestNumIterVal = numIterVal
bestRegParamVal = regParamVal
bestStepSizeVal = stepSizeVal
bestRegTypeVal = regTypeVal
bestTrainErr = trainErr
print numIterVal,regParamVal,stepSizeVal,regTypeVal,trainErr
print bestNumIterVal,bestRegParamVal,bestStepSizeVal,bestRegTypeVal,bestTrainErr
The performance of the best model can be computed on training and test datasets with the following code
model = LogisticRegressionWithSGD.train(trainingData, iterations=bestNumIterVal, regParam=bestRegParamVal, step=bestStepSizeVal, regType=bestRegTypeVal)
# Evaluating the model on training data
labelsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(trainingSize)
print trainErr
# Evaluating the model on training data
labelsAndPreds = testData.map(lambda p: (p.label, model.predict(p.features)))
testErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(testSize)
print testErr
The result of parameter selection for logistic regression is shown in the following table.
Iteration | C | Learning rate | Norm | Hamming loss |
---|---|---|---|---|
100 | 0.01 | 0.1 | l2 | 0.240232844509 |
100 | 0.01 | 0.1 | l1 | 0.240344788985 |
100 | 0.01 | 0.5 | l2 | 0.18246949513 |
100 | 0.01 | 0.5 | l1 | 0.205306168141 |
100 | 0.01 | 1 | l2 | 0.17194671443 |
100 | 0.01 | 1 | l1 | 0.179782827717 |
100 | 0.1 | 0.1 | l2 | 0.240232844509 |
100 | 0.1 | 0.1 | l1 | 0.240344788985 |
100 | 0.1 | 0.5 | l2 | 0.205306168141 |
100 | 0.1 | 0.5 | l1 | 0.240344788985 |
100 | 0.1 | 1 | l2 | 0.191536997649 |
100 | 0.1 | 1 | l1 | 0.240344788985 |
100 | 1 | 0.1 | l2 | 0.240344788985 |
100 | 1 | 0.1 | l1 | 0.240344788985 |
100 | 1 | 0.5 | l2 | 0.240344788985 |
100 | 1 | 0.5 | l1 | 0.240344788985 |
100 | 1 | 1 | l2 | 0.240344788985 |
100 | 1 | 1 | l1 | 0.240344788985 |
100 | 10 | 0.1 | l2 | 0.240344788985 |
100 | 10 | 0.1 | l1 | 0.240344788985 |
100 | 10 | 0.5 | l2 | 0.240344788985 |
100 | 10 | 0.5 | l1 | 0.240344788985 |
100 | 10 | 1 | l2 | 0.240344788985 |
100 | 10 | 1 | l1 | 0.240344788985 |
100 | 100 | 0.1 | l2 | 0.240344788985 |
100 | 100 | 0.1 | l1 | 0.240344788985 |
100 | 100 | 0.5 | l2 | 0.759655211015 |
100 | 100 | 0.5 | l1 | 0.240344788985 |
100 | 100 | 1 | l2 | 0.759655211015 |
100 | 100 | 1 | l1 | 0.240344788985 |
200 | 0.01 | 0.1 | l2 | 0.239785066607 |
200 | 0.01 | 0.1 | l1 | 0.240232844509 |
200 | 0.01 | 0.5 | l2 | 0.174857270794 |
200 | 0.01 | 0.5 | l1 | 0.188850330236 |
200 | 0.01 | 1 | l2 | 0.16791671331 |
200 | 0.01 | 1 | l1 | 0.173625881563 |
200 | 0.1 | 0.1 | l2 | 0.239785066607 |
200 | 0.1 | 0.1 | l1 | 0.240344788985 |
200 | 0.1 | 0.5 | l2 | 0.195678943244 |
200 | 0.1 | 0.5 | l1 | 0.240344788985 |
200 | 0.1 | 1 | l2 | 0.190417552894 |
200 | 0.1 | 1 | l1 | 0.240344788985 |
200 | 1 | 0.1 | l2 | 0.240344788985 |
200 | 1 | 0.1 | l1 | 0.240344788985 |
200 | 1 | 0.5 | l2 | 0.240344788985 |
200 | 1 | 0.5 | l1 | 0.240344788985 |
200 | 1 | 1 | l2 | 0.240344788985 |
200 | 1 | 1 | l1 | 0.240344788985 |
200 | 10 | 0.1 | l2 | 0.240344788985 |
200 | 10 | 0.1 | l1 | 0.240344788985 |
200 | 10 | 0.5 | l2 | 0.240344788985 |
200 | 10 | 0.5 | l1 | 0.240344788985 |
200 | 10 | 1 | l2 | 0.240344788985 |
200 | 10 | 1 | l1 | 0.240344788985 |
200 | 100 | 0.1 | l2 | 0.240344788985 |
200 | 100 | 0.1 | l1 | 0.240344788985 |
200 | 100 | 0.5 | l2 | 0.759655211015 |
200 | 100 | 0.5 | l1 | 0.240344788985 |
200 | 100 | 1 | l2 | 0.759655211015 |
200 | 100 | 1 | l1 | 0.240344788985 |
The best parameter is shown in the following table.
Iteration | C | Learning rate | Norm | Hamming loss |
---|---|---|---|---|
200 | 0.01 | 1 | l2 | 0.16791671331 |
Hamming loss on both training and test data from logistic regression with the best parameter is shown in the following table.
Training set | Test set | |
---|---|---|
LR | 0.1660 | 0.1787 |