Spark regression models

Table of content

System and experiment settings

  • Spark is running on a cluster of 1 master node 14 slave nodes. Each node is a work station with 16 x E5540@2.53GHz CPU and 32G memory.

  • In this blog post, several linear regression models will be presented, including

    • least square regression
    • lasso regression
    • logistic regression
    • decision tree regression
    • random forest regression
  • If you are also interested in classification models provided in Spark, I have another post about Spark classification models.

  • Dataset used in the following regression experiment include

    • median size cadata data available from LibSVM website.
    • larger size YearPredictionMSD data available also from LibSVM website.
  • In particular, the data file is in libsvm format. It is a sparse feature representation which can be naturally handled/loaded by a Spark Python function.

  • In order to train a regression model and test it performance, we split the original dataset into training set and test set. More specifically, we sample 80% of examples uniformly at random to form a training set for learning a regression model, and sample 20% of the examples to form a test set which is used to test the performance of the constructed model.

  • The statistics of the cadata is shown in the following table.

    CategorySize
    All20640
    Training16505
    Test4135
    Feature8
  • The statistics of the YearPredictionMSD dataset is shown in the following table.

    CategorySize
    All463715
    Training371065
    Test92650
    Feature90
  • It is worth noting that the following Spark Python code can also be deployed on Spark for other machine learning problems/datasets given the data file in libsvm format. Otherwise, you need a new data loading function.

Summary of results

  • In this section, I present an overview of results achieved by different regression models provided in Spark Python framework.
  • Same sampling strategy is used in different regression models to split the original dataset into training and test sets. In particular, we sample 80% examples to for a training set and 20% for test set.
  • The performance of different regression models is measured in terms of rooted mean square error RMSE on both training and test sets.

$$RMSE = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(y_i-w^Tx_i)^2}$$

  • The performance in RMSE is shown in the following table.

    RMSE on training setRMST on test set
    Least square157863.57154816.97
    Lasso157841.41155106.52
    Ridge regression157846.79155111.65
    Decision tree regression810.3071149.69
    Random forest regression19943.0349763.60
  • The result somehow demonstrates that on cadata dataset decision tree and random forest regressors achieve better RMSE compared to least square regression, lasso, and ridge regression.

  • Decision tree regressor seems to overfit the training data while random forest regressor achieves the best performance on test set.

  • The performance in RMSE is shown in the following table.

    RMSE on training setRMST on test set
    Least square
    Lasso
    Ridge regression
    Decision tree regression0.7013.38
    Random forest regression3.749.32
  • The result somehow demonstrates that on YearPredictionMSD dataset decision tree and random forest regressors achieve better RMSE compared to least square regression, lasso, and ridge regression.

  • Decision tree regressor seems to overfit the training data while random forest regressor achieves the best performance on test set.

Linear regression models

Three linear regression models will be covered in this blog post, including least square, ridge regression, and lasso. The application context is single label regression problem. Regression problem is sometimes closely related to classification problems, I would recommend my blog post about running classification model on Spark.

Load and save data files

  • loadLibSVMFile is the function to load data from file in libsvm format, which is a very popular file format for spark feature representation.

  • In particular, load data from file in libsvm format with the following command. This command will generate a Spark labelPoint data structure.

      parsedData = MLUtils.loadLibSVMFile(sc, "../Data/cadata")
  • saveAsLibSVMFile is the function to save data into a file in libsvm format which however will not be covered in this post.

Least square (code)

  • Least square is a linear model which fit a linear function to training data while minimizing the so called mean square error MSE.

  • The optimization problem of least square is shown as the follow

    $$\underset{w}{\min}, \frac{1}{n}\sum_{i}(y_i-w^Tx_i)^2$$

  • The idea of the following Python script is to load a single label regression dataset from file in libsvm format, separate the original dataset into training and test subsets, perform model training and parameter selection procedure on training set, then test the performance by predicting the value of test examples.

  • The complete Python code for running the following experiments with least square can be found from my GitHub.

  • It is workth noting that the learning rate parameter of stochastic gradient descent optimizaiton sometimes needs to be carefully. Otherwise, the model might not be well constructured and return NaN as prediction.

Run least square with parameter selections

  • The following code performs a parameter selection (grid search) of least square on training data.

    # train a lr model
    numIterValList = [1000,3000,5000]
    stepSizeValList = [1e-11,1e-9,1e-7,1e-5]
    
    # variable for the best parameters
    bestNumIterVal = 200
    bestStepSizeVal = 1
    bestTrainingRMSE = 1e10 
    
    regParamVal = 0.0
    regTypeVal = None
    
    for numIterVal,stepSizeVal in itertools.product(numIterValList,stepSizeValList):
      model = LinearRegressionWithSGD.train(trainingData, iterations=numIterVal, step=stepSizeVal, regParam=regParamVal, regType=regTypeVal)
      ValsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
      trainingRMSE = math.sqrt(ValsAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / trainingSize)
      if trainingRMSE:
        if trainingRMSE<bestTrainingRMSE:
          bestNumIterVal = numIterVal
          bestStepSizeVal = stepSizeVal
          bestTrainingRMSE = trainingRMSE
      print numIterVal,stepSizeVal,trainingRMSE
    print bestNumIterVal,bestStepSizeVal,bestTrainingRMSE

Model test

  • Test the performance of the model in both training data and test data by the following code.

    model = LinearRegressionWithSGD.train(trainingData, iterations=bestNumIterVal, step=bestStepSizeVal, regParam=regParamVal, regType=regTypeVal)
    
    # Evaluating the model on training data
    ValsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
    trainingRMSE = math.sqrt(ValsAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / trainingSize)
    print trainingRMSE
    
    # Evaluating the model on training data
    ValsAndPreds = testData.map(lambda p: (p.label, model.predict(p.features)))
    testRMSE = math.sqrt(ValsAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / testSize)
    print testRMSE

Experimental results

  • The result of parameter selection is shown in the following table.

    IterationLearning rateRMSE
    10001e-11235954.448184
    10001e-09178563.914495
    10001e-07162352.994777
    10001e-05nan
    30001e-11235106.111824
    30001e-09169423.475736
    30001e-07159639.878893
    30001e-05nan
    50001e-11234527.296389
    50001e-09167563.04618
    50001e-07157863.568992
    50001e-05nan
  • The best parameter setting is shown in the following table.

    IterationLearning rateRMSE
    50001e-07157863.568992
  • Rooted mean square errors RMSE on both training and test set from least square with the best parameter is shown in the following table.

    Training setTest set
    Least square157863.568992154816.967311

Lasso and ridge regression (code)

  • Lasso is similar as least square regression but with L1 norm regularization.

  • In particular, the optimization problem of Lasso is shown as the follows

    $$\underset{w}{\min}, \frac{1}{n}\sum_{i}(y_i-w^Tx_i)^2 + \frac{\lambda}{2}||w||_1^2,$$

    where $$||w||_1$$ is the L1 norm regularization of the feature weight parameter $$w$$. L1 norm regularization will enforce a sparse solution of the feature weight parameter $$w$$.

  • Ridge regression is also similar as least square regression but with L2 norm regularization.

  • In particular, the optimization problem of ridge regression is shown as the follows

    $$\underset{w}{\min}, \frac{1}{n}\sum_{i}(y_i-w^Tx_i)^2 + \frac{\lambda}{2}||w||_2^2,$$

    where $$||w||_2$$ is the L2 norm regularization of the feature weight parameter $$w$$. L2 norm regularization will lead to a smooth solution of the feature weight parameter $$w$$.

  • The following sections describe a Python code for Lasso and ridge regression implemented with function LinearRegressionWithSGD switching regType parameter. Meanwhile, there is another function LassoWithSGD available in Spark.

Run Lasso/Ridge with parameter selections

  • The following code performs a parameter selection (grid search) of Lasso on training data.

    # train a lr model
    numIterValList = [1000,3000,5000]
    stepSizeValList = [1e-11,1e-9,1e-7,1e-5]
    regParamValList = [0.01,0.1,1,10,100]
    
    # variable for the best parameters
    bestNumIterVal = 200
    bestStepSizeVal = 1
    bestTrainingRMSE = 1e10 
    bestRegParamVal = 0.0
    
    regTypeVal = 'l1'
    
    for numIterVal,stepSizeVal,regParamVal in itertools.product(numIterValList,stepSizeValList,regParamValList):
      model = LinearRegressionWithSGD.train(trainingData, iterations=numIterVal, step=stepSizeVal, regParam=regParamVal, regType=regTypeVal)
      ValsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
      trainingRMSE = math.sqrt(ValsAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / trainingSize)
      if trainingRMSE:
        if trainingRMSE<bestTrainingRMSE:
          bestNumIterVal = numIterVal
          bestStepSizeVal = stepSizeVal
          bestTrainingRMSE = trainingRMSE
      print numIterVal,stepSizeVal,trainingRMSE
    print bestNumIterVal,bestStepSizeVal,bestTrainingRMSE

Model test

  • I use the following code to test the performance of the constructed model on both training and test set. The performance is measured by rooted mean square error RMSE.

    model = LinearRegressionWithSGD.train(trainingData, iterations=bestNumIterVal, step=bestStepSizeVal, regParam=regParamVal, regType=regTypeVal)
    # Evaluating the model on training data
    ValsAndPreds = trainingData.map(lambda p: (p.label, model.predict(p.features)))
    trainingRMSE = math.sqrt(ValsAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / trainingSize)
    print trainingRMSE
    # Evaluating the model on training data
    ValsAndPreds = testData.map(lambda p: (p.label, model.predict(p.features)))
    testRMSE = math.sqrt(ValsAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / testSize)
    print testRMSE

Experimental results for Lasso

  • Experimental results of parameter selection for Lasso is shown in the following table.

    IterationLearning rateRegularizationRMSE
    10001e-110.01236203.490014
    10001e-110.1236203.490014
    10001e-111236203.490017
    10001e-1110236203.490043
    10001e-11100236203.490306
    10001e-090.01178516.068253
    10001e-090.1178516.068262
    10001e-091178516.068352
    10001e-0910178516.069244
    10001e-09100178516.07817
    10001e-070.01162300.327801
    10001e-070.1162300.327855
    10001e-071162300.328397
    10001e-0710162300.333816
    10001e-07100162300.388008
    10001e-050.01nan
    10001e-050.1nan
    10001e-051nan
    10001e-0510nan
    10001e-05100nan
    30001e-110.01235349.196217
    30001e-110.1235349.196218
    30001e-111235349.196222
    30001e-1110235349.196268
    30001e-11100235349.196724
    30001e-090.01169380.476224
    30001e-090.1169380.476231
    30001e-091169380.476296
    30001e-0910169380.476946
    30001e-09100169380.483446
    30001e-070.01159605.030202
    30001e-070.1159605.030294
    30001e-071159605.031219
    30001e-0710159605.040461
    30001e-07100159605.132881
    30001e-050.01nan
    30001e-050.1nan
    30001e-051nan
    30001e-0510nan
    30001e-05100nan
    50001e-110.01234766.331363
    50001e-110.1234766.331364
    50001e-111234766.331369
    50001e-1110234766.331428
    50001e-11100234766.332016
    50001e-090.01167529.15979
    50001e-090.1167529.159795
    50001e-091167529.159844
    50001e-0910167529.160328
    50001e-09100167529.165176
    50001e-070.01157841.276486
    50001e-070.1157841.276602
    50001e-071157841.277759
    50001e-0710157841.289332
    50001e-07100157841.405059
    50001e-050.01nan
    50001e-050.1nan
    50001e-051nan
    50001e-0510nan
    50001e-05100nan

    nan is in the situation that we have a poor model due to the step size of SGD.

  • It seems that learning rate parameter plays an important role in the performance of the model. When fix the learning rate, regularization parameter $$\lambda$$ slighly effects the performance of the model.

  • The best parameter setting is shown in the following table.

    IterationLearning rateRegularizationRMSE
    50001e-070.01157841.276486
  • Rooted mean square errors RMSE on both training and test sets from Lasso with the best parameters is shown in the following table.

    Training setTest set
    Lasso157841.405059155106.51828

Experimental results for ridge regression

  • Experimental results of parameter selection for ridge regression is shown in the following table.

    IterationLearning rateRegularizationRMSE
    10001e-110.01236203.490014
    10001e-110.1236203.490014
    10001e-111236203.490014
    10001e-1110236203.490017
    10001e-11100236203.490049
    10001e-090.01178516.068262
    10001e-090.1178516.068351
    10001e-091178516.069235
    10001e-0910178516.078079
    10001e-09100178516.166519
    10001e-070.01162300.327913
    10001e-070.1162300.328979
    10001e-071162300.339636
    10001e-0710162300.44621
    10001e-07100162301.51175
    10001e-050.01nan
    10001e-050.1nan
    10001e-051nan
    10001e-0510nan
    10001e-05100nan
    30001e-110.01235349.196217
    30001e-110.1235349.196217
    30001e-111235349.196218
    30001e-1110235349.196228
    30001e-11100235349.196325
    30001e-090.01169380.476234
    30001e-090.1169380.476324
    30001e-091169380.477232
    30001e-0910169380.486313
    30001e-09100169380.577123
    30001e-070.01159605.030531
    30001e-070.1159605.033588
    30001e-071159605.064158
    30001e-0710159605.369845
    30001e-07100159608.425707
    30001e-050.01nan
    30001e-050.1nan
    30001e-051nan
    30001e-0510nan
    30001e-05100nan
    50001e-110.01234766.331363
    50001e-110.1234766.331363
    50001e-111234766.331365
    50001e-1110234766.331381
    50001e-11100234766.331543
    50001e-090.01167529.159798
    50001e-090.1167529.159869
    50001e-091167529.160586
    50001e-0910167529.167747
    50001e-09100167529.239367
    50001e-070.01157841.277025
    50001e-070.1157841.281988
    50001e-071157841.331623
    50001e-0710157841.827952
    50001e-07100157846.789126
    50001e-050.01nan
    50001e-050.1nan
    50001e-051nan
    50001e-0510nan
    50001e-05100nan
  • The best parameter setting is shown in the following table.

    IterationLearning rateRegularizationRMSE
    50001e-070.01157841.277025
  • Rooted mean square errors RMSE on both training and test sets from ridge regression with the best parameter is shown in the following table.

    Training setTest set
    Ridge regression157846.789126155111.648864

Decision tree regressor (code)

Experimental results

YearPredictionMSD dataset download

  • results of parameter selection for decision tree regressor is shown in the following table.

    maxdepthmaxbinsrmse
    10169.43431370515
    10249.41958875566
    10329.41548620703
    20164.35768086186
    20244.4491520211
    20324.36669679487
    30160.698248656885
    30240.79267138039
    30320.867840147731
    30160.698248656885
  • Performance of decision tree regressor with best parameter on training and test sets

    maxDepthmaxBinsTraining RMSEtest RMSE
    30160.69824865688513.3839477649

cadata dataset download

  • results of parameter selection for decision tree regressor is shown in the following table.

    maxdepthmaxbinsrmse
    101655111.5032939
    102451187.6835561
    103249280.3022651
    201610338.3244273
    20248516.19589408
    20327565.77286084
    30166756.45379602
    30243077.23346467
    3032810.303938008
    3032810.303938008
  • Performance of decision tree regressor with best parameter on training and test sets

    maxDepthmaxBinsTraining RMSEtest RMSE
    3032810.30393800871149.6926611

Coding details

  • The Python function for training a decision tree regressor is shown in the following code block. The complete Python script can be found from my GitHub page.

    def decisionTreeRegression(trainingData,testData,trainingSize,testSize):
    '''
    decision tree for regression
    '''
    # parameter range
    maxDepthValList = [5,10,15]
    maxBinsVal = [16,24,32]
    
    # best parameters
    bestMaxDepthVal = 5
    bestMaxBinsVal = 16
    bestTrainingRMSE = 1e10
    
    for maxDepthVal,maxBinsVal in itertools.product(maxDepthValList,maxBinsVal):
      model = DecisionTree.trainRegressor(trainingData,categoricalFeaturesInfo={},impurity='variance',maxDepth=maxDepthVal,maxBins=maxBinsVal)
      predictions = model.predict(trainingData.map(lambda x:x.features))
      ValsAndPreds = trainingData.map(lambda x:x.label).zip(predictions)
      trainingRMSE = math.sqrt(ValsAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / trainingSize)
      if trainingRMSE:
        if trainingRMSE<bestTrainingRMSE:
          bestMaxDepthVal = maxDepthVal
          bestMaxBinsVal = maxBinsVal
          bestTrainingRMSE = trainingRMSE
      print maxDepthVal, maxBinsVal, trainingRMSE
    print bestMaxDepthVal,bestMaxBinsVal,bestTrainingRMSE
    
    model = DecisionTree.trainRegressor(trainingData,categoricalFeaturesInfo={},impurity='variance',maxDepth=bestMaxDepthVal,maxBins=bestMaxBinsVal)
    
    # evaluating the model on training data
    predictions = model.predict(trainingData.map(lambda x:x.features))
    ValsAndPreds = trainingData.map(lambda x:x.label).zip(predictions)
    trainingRMSE = math.sqrt(ValsAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / trainingSize)
    print trainingRMSE
    
    # evaluating the model on test data
    predictions = model.predict(testData.map(lambda x:x.features))
    ValsAndPreds = testData.map(lambda x:x.label).zip(predictions)
    testRMSE = math.sqrt(ValsAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / testSize)
    print testRMSE

Random forest regressor (code)

Experimental results

YearPredictionMSD dataset download

  • results of parameter selection for decision tree regressor is shown in the following table.

    maxdepthmaxbinsnumTreesRMSE
    1016109.25951426448
    1016209.22474837657
    1016309.23054126678
    1024109.23919432668
    1024209.20489367291
    1024309.19897910587
    1032109.25450519266
    1032209.20567410721
    1032309.18749240617
    2016105.1040872132
    2016204.82955431151
    2016304.7025300781
    2024105.11964372367
    2024204.84030537361
    2024304.78313760797
    2032105.22852708594
    2032204.91953677671
    2032304.86195922299
    3016104.12055264414
    3016203.74304424697
    3024104.1223783085
    3024203.75372882494
    3032104.10218005322
    3032203.75214909232
  • Performance of decision tree regressor with best parameter on training and test sets

    maxDepthmaxBinsnumTreesTraining RMSEtest RMSE
    3016203.743044246979.32017817336

cadata dataset download

  • results of parameter selection for decision tree regressor is shown in the following table.

    maxdepthmaxbinsnumTreesRMSE
    10161055038.0827027
    10163053430.1370036
    10165052858.4370596
    10241050673.4009976
    10243049798.1615418
    10245050149.3180680
    10321049471.2622304
    10323049746.8571448
    10325048637.6722695
    20161027888.0801328
    20163024986.1227465
    20165024715.9205242
    20241025038.4034279
    20243022242.7560252
    20245021939.0580146
    20321023934.9090671
    20323021621.0973069
    20325021045.6223585
    30161027439.8585243
    30163024156.0625537
    30165024046.7530621
    30241024697.7285380
    30243021434.6262417
    30245020866.6998838
    30321023527.2245341
    30323020808.0404106
  • Performance of decision tree regressor with best parameter on training and test sets

    maxDepthmaxBinsnumTreesTraining RMSEtest RMSE
    30325019943.030087349763.5977607

Coding details

  • The Python function for training a random forest regressor is shown in the following code block. The complete Python script can be found from my GitHub page.

    def randomForestRegression(trainingData,testData,trainingSize,testSize):
    '''
    random forest for regression
    '''
    # parameter range
    maxDepthValList = [10,20,30]
    maxBinsValList = [16,24,32]
    numTreesValList = [10,30,50]
    
    # best parameters
    bestMaxDepthVal = 10
    bestMaxBinsVal = 16
    bestNumTreesVal = 10
    bestTrainingRMSE = 1e10
    
    for maxDepthVal,maxBinsVal,numTreesVal in itertools.product(maxDepthValList,maxBinsValList,numTreesValList):
      model = RandomForest.trainRegressor(trainingData,categoricalFeaturesInfo={},numTrees=numTreesVal,featureSubsetStrategy="auto",impurity='variance',maxDepth=maxDepthVal,maxBins=maxBinsVal)
      predictions = model.predict(trainingData.map(lambda x:x.features))
      ValsAndPreds = trainingData.map(lambda x:x.label).zip(predictions)
      trainingRMSE = math.sqrt(ValsAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / trainingSize)
      if trainingRMSE:
        if trainingRMSE<bestTrainingRMSE:
          bestMaxDepthVal = maxDepthVal
          bestMaxBinsVal = maxBinsVal
          bestNumTreesVal = numTreesVal
          bestTrainingRMSE = trainingRMSE
      print maxDepthVal, maxBinsVal, numTreesVal, trainingRMSE
    print bestMaxDepthVal,bestMaxBinsVal, bestNumTreesVal, bestTrainingRMSE
    
    model = RandomForest.trainRegressor(trainingData,categoricalFeaturesInfo={},numTrees=bestNumTreesVal,featureSubsetStrategy="auto",impurity='variance',maxDepth=bestMaxDepthVal,maxBins=bestMaxBinsVal)
    
    # evaluating the model on training data
    predictions = model.predict(trainingData.map(lambda x:x.features))
    ValsAndPreds = trainingData.map(lambda x:x.label).zip(predictions)
    trainingRMSE = math.sqrt(ValsAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / trainingSize)
    print trainingRMSE
    
    # evaluating the model on test data
    predictions = model.predict(testData.map(lambda x:x.features))
    ValsAndPreds = testData.map(lambda x:x.label).zip(predictions)
    testRMSE = math.sqrt(ValsAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / testSize)
    print testRMSE

External reading materials