Spark classification models.

Table of content

System and experiment settings

  • Spark is running on a cluster of 1 master node 14 slave nodes. Each node is a work station with 16 x E5540@2.53GHz CPU and 32G memory.

  • Dataset used in the experiment of this post is the well-known a6a data from LibSVM website.

  • The file is in libsvm format which is a sparse feature representation, which can be naturally tackled/loaded by a Spark Python function.

  • In order to train a classification model and test it performance, we draw samples uniform at random from the original dataset which forms a training set with 80% examples and a test set with 20% examples.

  • The statistics of the dataset is shown in the following table

    CategorySize
    All11220
    Training8977
    Test2243
    Feature123
  • The following Spark Python code can also be deployed on Spark with other machine learning problems/datasets given the data file in libsvm format. Otherwise, a loading function should be implemented.

Summary of results

  • In this section, I present an overview of results achieved by different classification models provided by Spark Python framework.

  • We use a same training and test split for different learning models, which in general is a 80%/20% random split.

  • The performance is measured by Hamming loss and is computed both on training set and test set, shown in the following table.

    Training setTest set
    SVM0.16500.1575
    LR0.16600.1787
  • The result somehow demonstrates that on a6a dataset, SVM achieves better performance compared to logistic regression. In particular, the classification accuracy of SVM on test dataset is about 2% higher than logistic regression.

Linear classification models

Two classification learning methods will be discussed, support vector machines SVM and logistic regression LR. The application context is single label binary classification. They can also be applied to single label multiclass classification which however will not be covered in this blog post.

Load and save data files

  • loadLibSVMFile is the function to load data from file in libsvm format, which is a very popular file format for spark feature representation.

  • In particular, load data from file in libsvm format with the following command. This command will generate a Spark labelPoint data structure.

      parsedData = MLUtils.loadLibSVMFile(sc, "../Data/a6a")
  • It is worth noting that if you load training and test dataset separately it is possible that the dimension of the feature in training and test sets are different due to the spark representation of libsvm file format. Therefore, it is better to load the whole dataset and split for training and test later on. If you have to load training and test data separately, you can set the dimension of the feature as one of the input argument of the function.

  • saveAsLibSVMFile is the function to save data into a file in libsvm format.

Support vector machine SVM (code)

  • In general, the idea is to load a binary classification dataset in libsvm format from a file, separate training and test, perform parameter selection on training data, and make prediction on test data.
  • The complete Python code for running the following experiments with SVM can be found from my GitHub.

Run SVM with parameter selections

  • The function will take the following parameters

    • data: the training data, an RDD of LabeledPoint.
    • iteration: The number of iterations (default: 100).
    • step: The step parameter used in SGD (default: 1.0).
    • regParam: The regularizer parameter (default: 0.01).
    • miniBatchFraction: Fraction of data to be used for each SGD iteration (default: 1.0).
    • initialWeights: The initial weights (default: None).
    • regType: l2 or l1.
  • It is better to check the document of the function because Spark changes rapidly and different versions might not tolerate each other. For example, my Spark is 1.4.1, I check the version of the function in Spark Document.

  • The following code performs a parameter selection (grid search) of SVM on training data.

    # train a SVM model
    numIterValList = [100,200]
    regParamValList = [0.01,0.1,1,10,100]
    stepSizeValList = [0.1,0.5,1]
    regTypeValList = ['l2','l1']
    
    # variable for the best parameters
    bestNumIterVal = 0
    bestRegParamVal = 0
    bestStepSizeVal = 0
    bestRegTypeVal = 0
    bestTrainErr = 100
    
    for numIterVal,regParamVal,stepSizeVal,regTypeVal in itertools.product(numIterValList,regParamValList,stepSizeValList,regTypeValList):
      model = SVMWithSGD.train(trainingData, iterations=numIterVal, regParam=regParamVal, step=stepSizeVal, regType=regTypeVal)
      labelsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
      trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(trainSize)
      if trainErr<bestTrainErr:
        bestNumIterVal = numIterVal
        bestRegParamVal = regParamVal
        bestStepSizeVal = stepSizeVal
        bestRegTypeVal = regTypeVal
        bestTrainErr = trainErr
    print bestNumIterVal,bestRegParamVal,bestStepSizeVal,bestRegTypeVal,bestTrainErr

Model test

  • Test the performance of the model in both training data and test data by the following code block.

    # Evaluating the model on training data
    labelsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
    trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(trainSize)
    print trainErr
    
    # Evaluating the model on training data
    labelsAndPreds = testData.map(lambda p: (p.label, model.predict(p.features)))
    testErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(testSize)
    print testErr

Experimental results

  • The result of parameter selection is shown in the following table.

    |Iteration|C|Learning rate|Norm|Hamming loss| |:—|:—|:—|—:| |100|0.01|0.1|l2|0.237813720022| |100|0.01|0.1|l1|0.237813720022| |100|0.01|0.5|l2|0.186279977691| |100|0.01|0.5|l1|0.225432236475| |100|0.01|1|l2|0.16419408812| |100|0.01|1|l1|0.180479643056| |100|0.1|0.1|l2|0.237813720022| |100|0.1|0.1|l1|0.237813720022| |100|0.1|0.5|l2|0.224316787507| |100|0.1|0.5|l1|0.237813720022| |100|0.1|1|l2|0.206804238706| |100|0.1|1|l1|0.237813720022| |100|1|0.1|l2|0.237813720022| |100|1|0.1|l1|0.237813720022| |100|1|0.5|l2|0.237813720022| |100|1|0.5|l1|0.237813720022| |100|1|1|l2|0.237813720022| |100|1|1|l1|0.237813720022| |100|10|0.1|l2|0.237813720022| |100|10|0.1|l1|0.237813720022| |100|10|0.5|l2|0.237813720022| |100|10|0.5|l1|0.237813720022| |100|10|1|l2|0.237813720022| |100|10|1|l1|0.237813720022| |100|100|0.1|l2|0.237813720022| |100|100|0.1|l1|0.237813720022| |100|100|0.5|l2|0.762186279978| |100|100|0.5|l1|0.237813720022| |100|100|1|l2|0.762186279978| |100|100|1|l1|0.237813720022| |200|0.01|0.1|l2|0.237813720022| |200|0.01|0.1|l1|0.237813720022| |200|0.01|0.5|l2|0.167540435025| |200|0.01|0.5|l1|0.21226993865| |200|0.01|1|l2|0.162744004462| |200|0.01|1|l1|0.169659788065| |200|0.1|0.1|l2|0.237813720022| |200|0.1|0.1|l1|0.237813720022| |200|0.1|0.5|l2|0.215839375349| |200|0.1|0.5|l1|0.237813720022| |200|0.1|1|l2|0.204350250976| |200|0.1|1|l1|0.237813720022| |200|1|0.1|l2|0.237813720022| |200|1|0.1|l1|0.237813720022| |200|1|0.5|l2|0.237813720022| |200|1|0.5|l1|0.237813720022| |200|1|1|l2|0.237813720022| |200|1|1|l1|0.237813720022| |200|10|0.1|l2|0.237813720022| |200|10|0.1|l1|0.237813720022| |200|10|0.5|l2|0.237813720022| |200|10|0.5|l1|0.237813720022| |200|10|1|l2|0.237813720022| |200|10|1|l1|0.237813720022| |200|100|0.1|l2|0.237813720022| |200|100|0.1|l1|0.237813720022| |200|100|0.5|l2|0.762186279978| |200|100|0.5|l1|0.237813720022| |200|100|1|l2|0.762186279978| |200|100|1|l1|0.237813720022|

  • The best parameter is shown in the following table.

    |Iteration|C|Learning rate|Norm|Hamming loss| |:—|:—|:—|—:| |200|0.01|1|l2|0.162744004462|

  • Hamming loss on both training and test data from SVM with the best parameter is shown in the following table.

    Training setTest set
    SVM0.16500.1575

Logistic regression LR (code)

  • In general, the idea is to load a binary classification dataset in libsvm format from a file, separate training and test, perform parameter selection on training data, and make prediction on test data.
  • The complete Python code for running the following experiments with logistic regression can be found from my GitHub.

Run LR for parameter selection

  • Python code for running parameter selection procedure of logistic regression is shown in the following code block

    def lr(trainingData,testData,trainingSize,testSize):
    '''
    linear lr classifier
    '''
    # train a lr model
    numIterValList = [100,200]
    regParamValList = [0.01,0.1,1,10,100]
    stepSizeValList = [0.1,0.5,1]
    regTypeValList = ['l2','l1']
    
    # variable for the best parameters
    bestNumIterVal = 200
    bestRegParamVal = 0.01
    bestStepSizeVal = 1
    bestRegTypeVal = 'l2'
    bestTrainErr = 100
    
    for numIterVal,regParamVal,stepSizeVal,regTypeVal in itertools.product(numIterValList,regParamValList,stepSizeValList,regTypeValList):
      model = LogisticRegressionWithSGD.train(trainingData, iterations=numIterVal, regParam=regParamVal, step=stepSizeVal, regType=regTypeVal)
      labelsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
      trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(trainingSize)
      if trainErr<bestTrainErr:
        bestNumIterVal = numIterVal
        bestRegParamVal = regParamVal
        bestStepSizeVal = stepSizeVal
        bestRegTypeVal = regTypeVal
        bestTrainErr = trainErr
      print numIterVal,regParamVal,stepSizeVal,regTypeVal,trainErr
    print bestNumIterVal,bestRegParamVal,bestStepSizeVal,bestRegTypeVal,bestTrainErr
  • The performance of the best model can be computed on training and test datasets with the following code

    model = LogisticRegressionWithSGD.train(trainingData, iterations=bestNumIterVal, regParam=bestRegParamVal, step=bestStepSizeVal, regType=bestRegTypeVal)
    
    # Evaluating the model on training data
    labelsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
    trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(trainingSize)
    print trainErr
    
    # Evaluating the model on training data
    labelsAndPreds = testData.map(lambda p: (p.label, model.predict(p.features)))
    testErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(testSize)
    print testErr

Experimental results

  • The result of parameter selection for logistic regression is shown in the following table.

    |Iteration|C|Learning rate|Norm|Hamming loss| |:—|:—|:—|—:| |100|0.01|0.1|l2|0.240232844509| |100|0.01|0.1|l1|0.240344788985| |100|0.01|0.5|l2|0.18246949513| |100|0.01|0.5|l1|0.205306168141| |100|0.01|1|l2|0.17194671443| |100|0.01|1|l1|0.179782827717| |100|0.1|0.1|l2|0.240232844509| |100|0.1|0.1|l1|0.240344788985| |100|0.1|0.5|l2|0.205306168141| |100|0.1|0.5|l1|0.240344788985| |100|0.1|1|l2|0.191536997649| |100|0.1|1|l1|0.240344788985| |100|1|0.1|l2|0.240344788985| |100|1|0.1|l1|0.240344788985| |100|1|0.5|l2|0.240344788985| |100|1|0.5|l1|0.240344788985| |100|1|1|l2|0.240344788985| |100|1|1|l1|0.240344788985| |100|10|0.1|l2|0.240344788985| |100|10|0.1|l1|0.240344788985| |100|10|0.5|l2|0.240344788985| |100|10|0.5|l1|0.240344788985| |100|10|1|l2|0.240344788985| |100|10|1|l1|0.240344788985| |100|100|0.1|l2|0.240344788985| |100|100|0.1|l1|0.240344788985| |100|100|0.5|l2|0.759655211015| |100|100|0.5|l1|0.240344788985| |100|100|1|l2|0.759655211015| |100|100|1|l1|0.240344788985| |200|0.01|0.1|l2|0.239785066607| |200|0.01|0.1|l1|0.240232844509| |200|0.01|0.5|l2|0.174857270794| |200|0.01|0.5|l1|0.188850330236| |200|0.01|1|l2|0.16791671331| |200|0.01|1|l1|0.173625881563| |200|0.1|0.1|l2|0.239785066607| |200|0.1|0.1|l1|0.240344788985| |200|0.1|0.5|l2|0.195678943244| |200|0.1|0.5|l1|0.240344788985| |200|0.1|1|l2|0.190417552894| |200|0.1|1|l1|0.240344788985| |200|1|0.1|l2|0.240344788985| |200|1|0.1|l1|0.240344788985| |200|1|0.5|l2|0.240344788985| |200|1|0.5|l1|0.240344788985| |200|1|1|l2|0.240344788985| |200|1|1|l1|0.240344788985| |200|10|0.1|l2|0.240344788985| |200|10|0.1|l1|0.240344788985| |200|10|0.5|l2|0.240344788985| |200|10|0.5|l1|0.240344788985| |200|10|1|l2|0.240344788985| |200|10|1|l1|0.240344788985| |200|100|0.1|l2|0.240344788985| |200|100|0.1|l1|0.240344788985| |200|100|0.5|l2|0.759655211015| |200|100|0.5|l1|0.240344788985| |200|100|1|l2|0.759655211015| |200|100|1|l1|0.240344788985|

  • The best parameter is shown in the following table.

    |Iteration|C|Learning rate|Norm|Hamming loss| |:—|:—|:—|—:| |200|0.01|1|l2|0.16791671331|

  • Hamming loss on both training and test data from logistic regression with the best parameter is shown in the following table.

    Training setTest set
    LR0.16600.1787

External reading materials