Scikit: A machine learning package for Python

Scikit: A machine learning package for Python

Table of content

Instruction for installation

The installation of Scikit-learn package seems to require an administration right of the system. For example, if you install the package with pip install. However, it does not necessarily require an admin right. You can just install the package under your own directory, e.g., if you are on a computer cluster with shared filesystem. Just follow the following simple instructions.

  1. Get the Scikit-learn package from its Github page with the following command

    git clone git@github.com:scikit-learn/scikit-learn.git

  2. Install the package under the current user directory with the following command

    python install.py --user

  3. Compile the package with make

  4. Now you are ready for Scikit-learn. Open a Python interpreter and import the package

    import sklearn

  5. For more details, please refer to the Scikit-learn homepage.

Coding examples

Support vector machines for MNIST dataset

  1. Scikit uses an implementation of SVMs from libSVM.
  2. Use sklearn.datasets module to load MNIST data. There are 70000 handwriting digit in MNIST dataset.
  3. For now, we use preprocessed handwriting digit data from here.
  4. sklearn.cross_validation module is used to randomly split the original MNIST dataset into training and test sets. In particular, we use 60000 handwriting digit for training and 10000 digits for test.
  5. In this demo, we use linear SVMs without parameter selection. One can always use kernel functions (e.g., Gaussian kernel) with parameter tuning to achieve better result.
  6. For the above setting, I get 9435/10000 writings correct.
  7. The script is shown as the following
from sklearn.datasets import fetch_mldata
from sklearn import svm
from sklearn.cross_validation import train_test_split
import numpy as np
from collections import Counter
import cPickle
import gzip
def svm_baseline():
  #mnist = fetch_mldata('MNIST original', data_home='../ScikitData')
  #Xtr,Xts,Ytr,Yts = train_test_split(mnist.data, mnist.target, test_size=10000, random_state=42)
  f = gzip.open('../ScikitData/mnist.pkl.gz','rb')
  training_data,validation_data,test_data=cPickle.load(f)
  f.close()
  model = svm.SVC()
  model.fit(training_data[0],training_data[1])
  predictions = [int(a) for a in model.predict(test_data[0])]
  num_corr = sum(int(a==y) for a,y in zip(predictions,test_data[1]))
  print "Baseline classifier using an SVM."
  print "%s of %s values correct." % (num_corr, len(test_data[1]))
  pass
if __name__ == '__main__':
  svm_baseline()

Random Forest for MNIST dataset

  1. Random forest classifier is implemented in sklearn.ensemble package.
  2. For a random forest classifier the following paraemters should be specified.
    1. n_estimators, the number of trees in the forest.
    2. max_depth, the maximum depth of a tree.
  3. In this demo, we use a random forest classifier with 10, 100, 500 random trees.
  4. The result is shown in the following table

    n_estimators Performance
    10 9466/10000
    100 9679/10000
    500 9706/10000
    1000 9711/10000
  5. The training time of a random forest classifier is much faster that SVMs.
  6. The Python script is shown as the following.
from sklearn.datasets import fetch_mldata
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
import numpy as np
from collections import Counter
import cPickle
import gzip
def rf_baseline():
  #mnist = fetch_mldata('MNIST original', data_home='../ScikitData')
  #Xtr,Xts,Ytr,Yts = train_test_split(mnist.data, mnist.target, test_size=10000, random_state=42)
  f = gzip.open('../ScikitData/mnist.pkl.gz','rb')
  training_data,validation_data,test_data=cPickle.load(f)
  f.close()
  model = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=1, random_state=0)
  model.fit(training_data[0],training_data[1])
  predictions = [int(a) for a in model.predict(test_data[0])]
  num_corr = sum(int(a==y) for a,y in zip(predictions,test_data[1]))
  print "Baseline classifier using an Random Forest."
  print "%s of %s values correct." % (num_corr, len(test_data[1]))
  pass
if __name__ == '__main__':
  rf_baseline()
Hongyu Su 13 August 2015