This blog post contains the strategies, models and algorithms I used in Facebook challenge of detecting bots in the online bidding environment. The model I developed has an AUC score of 93.269% where the best AUC score reported in the leaderboard is about 94.254%, ranking 56 out of 1004. Everything documented here is done within about one week’s time, and mostly at night. The model can be surely improved given more time. In particular,

Table of content

#Facebook challenge of detecting bots in the online bidding environment

  1. Event website: https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot
  2. Leaderboard: https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/leaderboard?submissionId=1680911
  3. My participation name: Aalto
  4. Ranking: 56 / 1004
  5. Performance in AUC: 93.269% / 94.254%
  6. GitHub project website: http://hongyusu.github.io/FacebookChallenge/
  7. GitHub project page: https://github.com/hongyusu/FacebookChallenge

Lessons learned

First thing first, lessons learned from this competition can be potentially useful. They are summarised in the following bullet list

  1. Support Vector Machine can be useful as a baseline classifier for the following reasons
    1. SVM is a stable classifier. It is robust against training data perturbation.
    2. There are rich kernel functions in SVM which can do nonlinear feature map that map the data point from the original feature space to high dimensional feature space.
    3. In particular, kernel SVM allows user-defined kernel function for measuring similarities.
    4. The optimization of the model is standard and fast.
    5. SVM training and prediction is implemented in, e.g., LibSVM package.
  2. However, it is better not to completely trust SVM for the following reasons
    1. It is not very efficient for large-scale data.
    2. Although kernel SVM is very useful, it is not practical in most cases when the number of training data is over a few thousands.
    3. It is a stable classifier, therefore, it is not very good to use as the base learner in ensemble learning.
  3. Random forest is another good option in addition to SVM for the following reasons:
    1. Random forest does not need feature normalization, cantering, etc.
    2. Well, I guess random forest somehow needs feature selection, but not very extensive.
    3. Random forest does not have many parameter to tune where the only parameter in random forest is the number of decision trees.
    4. It is relatively fast to learn a random forest classifier than learning a SVM model.
    5. Off-the-shelf function in MATLAB.
  4. Adding more feature is not always helpful. In many cases, more feature will be harmful to the performance.
  5. It is better to start with one classification model (e.g., SVM) and perform the experiments. Tune the model until hitting the bottleneck of the model. Then wisely change to another classification methods (e.g., random forest).
  6. Measure the performance in terms of AUC other than Accuracy, as accuracy is easily mislead by the biasness of the data.
  7. Training data and test data are sometime very heterogeneous. Parameter selection and model performance based on the Cross validation on the training data might not be similary on the test data.

Final strategy

The following is about the final strategy I decided to use in the competetion. It could be better given more time.

Feature representation

  1. Features are extract from auction log including local features, global features, local per hour feature, bag-of-words features.

  2. Some of the features are not used in the learning algorithm, which can potentially improve the prediction performance of the model.

    1. Global features

      1. Global feature are made by using all bids.
      2. Global features are listed in the following table.
      3. There is only one bit for bid feature, four bits for others.
      4. Feature number 8-14 are computed but are not used in this stage.
      Feature IdFeature NameFeature Position
      1bid1
      2auction3
      3merchandise4
      4device5
      5time6
      6country7
      7ip8
      8url9
      9minute10
      10hour11
      11day12
      12hourday13
      13interval14
      14average interval15
    2. Auction specific features

      1. Auction specific features are extracted from bider-auction files.
      2. We use auctions which has more than 2 bids.
      3. Features are listed in the following table.
      4. Besides there being one position for the bid sum, there is one position for average of each individual.
      Feature IdFeature NameFeature Position
      1bid1
      2device5
      3country7
      4ip8
      5url9
      6interval14
    3. Per-hour features

      1. Per-hour features include features listed in the following table.
      2. Per-hour features are computed based on hours.
      3. There is one position for the average of the per hour features.
      Feature IdFeature NameFeature Position
      1bid1
      2auction3
      3merchandise4
      4device5
      5country7
      6ip8
      7url9
    4. Per-minute features

      1. Per-minute features include features listed in the following table.
      2. Per-minute features are computed based on minute.
      3. There is one position for the average of the per minute features.
      Feature IdFeature NameFeature Position
      1bid1
      2auction3
      3merchandise4
      4device5
      5country7
      6ip8
      7url9
    5. bag-of-words features

      1. Each bider is represented by the following bag-of-words features.
      2. Most of them are not used in the current stage of the experiment due to the time limit.
      3. However, we consider auction/merchandise/device/country information as the only bag-of-words features.
      Feature IdFeature NameFeature Position
      1auction3
      2merchandise4
      3device5
      4time6
      5country7
      6ip8
      7url9

Learning algorithm

  1. Support Vector Machine

    1. Support Vector Machine (SVM) with probability output as baseline learner.
    2. Feature representation of data is introduced in the last section.
    3. We center the feature vector to have zero mean and unit variance such that points are located in a unit square.
    4. Centering the feature is done base on both training and testing data.
    5. We use Gaussian RBF kernel over the feature representation of the data points.
    6. We use grid Search for the best parameter combination, e,g, margin slack parameter, Gaussian width parameter.
    7. Best model is then used for prediction.
    8. We use the implementation of SVM from LibSVM toolbox.
    9. The basic MATLAB usage example is
      model = svmtrain(Ytr,Xtr,sprintf('-q -s 0 -c %.2f -t 2 -g %s -b 1',c,g));
      [~,~,Yprobsvm] = svmpredict(Yts,Xts, model ,'-b 1'); 
      [~,~,~,AUC] = perfcurve(Yts,Yprobsvm,1);
  2. Random Forest

    1. We use build-in function in MATLAB.
    2. We tune the parameter of the number of trees with 10 fold cross validation.
    3. The best model is then used for prediction.
    4. The basic MATLAB usage example is
      model = TreeBagger(c,Xtr,Ytr, 'Method', 'classification');
      [~,Yprobtree] = predict(model, Xts);
    5. Be careful with the order of the probability outputs.

Measure the performance

We use AUC score to measure the prediction performance of the models. AUC is the area under the ROC curve.

Experimental results:

  1. SVM

    1. global + auction + hour, duplicate 7 global feature, 10 fold cv

      ExperimentTimeTest AUCCV best AUCTraining all AUCKkSVM CRBF G
      06080.89160.99500.550
    2. auction + hour + minute, 10 fold cv

      ExperimentTimeTest AUCCV best AUCTraining all AUCKkSVM CRBF G
      06080.84100.932550.5
    3. global + auction + hour + minute, 10 fold cv

      ExperimentTimeTest AUCCV best AUCTraining all AUCKkSVM CRBF G
      06080.87880.9898150
    4. global + auction + hour + minute, duplicate 7 global, 10 fold cv

      ExperimentTimeTest AUCCV best AUCTraining all AUCKkSVM CRBF G
      06080.88950.99360.0150
    5. global + auction + hour + minute + bag-of-words country, duplicate 7 global, 10 fold cv

      ExperimentTimeTest AUCCV best AUCTraining all AUCKkSVM CRBF G
      06080.88190.9957550
    6. bag-of-words country, 10 fold cv, tfidf

      ExperimentTimeTest AUCCV best AUCTraining all AUCKkSVM CRBF G
      06080.69340.824110.001
    7. bag-of-words country, 10 fold cv, centering

      ExperimentTimeTest AUCCV best AUCTraining all AUCKkSVM CRBF G
      06080.84840.95070.50.5
  2. Random Forest

    1. Random forest, global + auction + hour + minute + bag-of-words country, 10 fold cv

      ExperimentTimeTest AUCCV best AUCTraining all AUCKkN
      06080.92200.9974125
    2. Random forest, global + auction + hour + minute + bag-of-words country/device, 10 fold cv

      ExperimentTimeTest AUCCV best AUCTraining all AUCKkN
      06080.907570.92380.9978135
    3. Random forest, global + auction + hour + minute + bag-of-words auction/merchandise/country/device, 10 fold cv

      ExperimentTimeTest AUCCV best AUCTraining all AUCKkN
      06080.91070.9980195
    4. Random forest, global + auction + hour + minute + bag-of-words merchandise/country/device, 10 fold cv

      ExperimentTimeTest AUCCV best AUCTraining all AUCKkN
      06080.91910.9981195

Best model

Best performance is achieved with model 2.2. In particular, it is a random forest learner with various hand made featuers.


History of coding and experiments

The following text documents the history of joining the competetion, including

  1. Different feature embedding strategy
  2. Coding
  3. Parameter selection methods
  4. Experimental results

Basic setting as baseline

  • Support Vector Machine (SVM) with probability output as baseline learner
  • Compose about 10 global values as feature representation of each bidder.
  • Gaussian RBF kernel over the feature representation.
  • Grid Search for the best parameter, e,g, margin slack parameter, Gaussian width parameter.
  • Best model is then used for prediction.
  • Prediction performance is shown in the following table
MethodsTimeTest AUCTraining best AUCTraining all AUC
Global201505280.724850.751660.90494
Global+Local201505300.873480.82300.8861
Global+Local(correct)201505300.898480.88840.8983
Global+Local(correct)+10201506010.881990.88580.9200
Global+Local(correct)+20201506010.881860.89860.9198
add new time feature201506010.889000.90200.9253

###Time

  • 1 second = 52631578
  • min: 9631916842105263
  • max: 9772885210526316
  • The data is for a period of 31 days.

###Feature unique bid

  1. unique auction
  2. unique merchandise
  3. unique device
  4. unique time in second
  5. unique country
  6. unique ip
  7. unique url
  8. unique minute
  9. unqiue hour
  10. unique day
  11. unique hour-day
  12. unique interval
  13. unique intervalcount
  • Global features (1+13x4)

    • How many unique XXX
    • MIN, MAX, AVERAGE bids per unique XXX
  • Local features

    • auction with more than 100 bidders
  • 1:12 (1-12)

  • 13:900

  • 1 (sum)

  • 12x4 (1-13/ time)

  • 3552

###Log

###History

  1. 1+7+7x4 features, SVM, 10 fold cv
  2. 1+7+7x4 features, SVM, 5 fold cv
  3. 1+7+7x4 features, SVM, 5 fold cv, 5 replicates
  4. 1+7+7x4 features, SVM, 10 fold cv, 10 replicates
  5. 1+11+11x4 features, SVM, 10 fold cv, 10 replicates
  6. 1+11+11x4 features, SVM, 5 fold cv, 5 replicates
  7. 1+11x4 features, SVM, 5 fold cv, 5 replicates
  8. 1+11x4 features, SVM, 10 fold cv, 10 replicates
  9. 1+13x4 features, SVM, 10 fold cv, 10 replicates
  10. 1+13x4 features, SVM, 5 fold cv, 5 replicates
  11. 1+13x4 features, SVM, 10 fold cv, 10 replicates, 0 feature -> 0 prediction
  12. 1+13x4 features, SVM, 10 fold cv, 10 replicates, voting over 10 replicates
  13. 1+13x4 features, SVM, 15 fold cv, 15 replicates, voting over 15 replicates
  14. 1+12x4 features (no hour-day), SVM, 15 fold cv, 15 replicates, voting over 15 replicates
  15. 1+11x4 features (no hour,hour-day), SVM, 15 fold cv, 15 replicates, voting over 15 replicates
indTimeTest AUCTraining best AUCTraining all AUCSVM CRBF G
1201506010.87760.9600100.1
2201506010.87290.98740.150
3201506010.87500.91810.050.01
4201506010.88080.91730.10.01
5201506010.88400.92120.050.01
6201506010.88730.92010.50.01
7201506010.87450.92570.05100
8201506010.87430.92750.011
9201506020.847970.88270.99020.550
10201506020.87900.98450.150
11201506020.847970.88370.99020.550
12201506020.855150.91480.550
13201506020.91590.550
14201506030.9184550
15201506030.9030550

Experimental settings with local featuers

  1. 1+12x4 global features + 10X2 averaged local features, auctions with > 100 bidders, 10 fold cv+replications, voting in single best fold, centering

  2. 1+12x4 global features + 10X2 averaged local features, auctions with > 100 bidders, 10 fold cv+replications, use all training data, centering

  3. 1+12x4 global features + 10X2 averaged local features, auctions with > 100 bidders, 10 fold cv, use all training data, centering

  4. 1+12x4 global features, auctions with > 100 bidders, 10 fold cv, use all training data, centering

  5. Nx10 local auction features, auctions with > 100 bidders, 10 fold cv+replications, centering

indTimeTest AUCTraining best AUCTraining all AUCKkSVM CRBF G
106040.834050.9160106500.5
206040.839910.89190.9868100.5
306040.90020.9868100.5
406040.836850.90280.99120.0150
  1. 1+13x4 global features (original x7, time x4, interval x2), 10 fold cv, use all training data, centering
  2. 1+13x4 global features (original x7, time x4, interval x2) + duplicate x1 global feature x13, 10 fold cv, use all training data, centering
  3. 1+13x4 global features (original x7, time x4, interval x2) + duplicate x2 global feature x13, 10 fold cv, use all training data, centering
  4. 1+13x4 global features (original x7, time x4, interval x2) + duplicate x1 global feature x13 + local, 10 fold cv, use all training data, centering
    1. local features
      1. local feature, which is extracted from bider-auction file
      2. use auction with more than 5 bids
      3. features include
        1. bid
        2. device
        3. country
        4. ip
        5. url
        6. interval
      4. there is one bit for bid sum, one bit for average of each individual (this constraint is added and experimented as 5.)
    2. global features
      1. global feature are made by using all bids
      2. global feature includes
        1. bid
        2. auction
        3. device
        4. country
        5. time
        6. ip
        7. url
        8. minute
        9. hour
        10. day
        11. hour-day
        12. interval
        13. average interval
      3. there is only one bit for bid feature, four bits for others
  5. As 4, only 4.4 changes
  6. As 5, 5 fold cv
  7. As 5, global feature without time, 5 fold cv
  8. As 5, global feature without hour-day and interval, 10 fold cv
  9. As 5, + local time features, 10 fold cv
  10. As 9, random seed is 0
ExperimentTimeTest AUCCV best AUCTraining all AUCKkSVM CRBF G
106050.89870.98240.0150
206050.90380.94060.011
306050.89960.98670.0550
406050.858810.89590.98110.150
506050.89620.98160.150
606050.89220.98030.550
706050.87270.98030.010.01
806050.89000.92200.050.01
906050.860970.89640.9851150
1006050.884440.89560.94450.51