Basic Comparison for basic machine learning methods

2 minutes read

This article summarizes the following post

Logistic Regression

  • For linearly separable data
  • Pretty robust. Can avoid overfitting by l1 or l2 regularization
  • Easily distributable
  • Get probability. Can do ranking directly
  • With l2 regularization, LR can be used as a baseline for any other fancier solutions
  • Not so good for categorical variables
  • Then go for SVM for Tree Ensembles models
  • Can be online. Just update with new data using online gradient descent
  • Discriminative model

Naive Bayes

  • Simple. Just do a bunch of counts.
  • Can be used as a baseline as well (just don’t dismiss it… yet)
  • High bias and low variance. But still useful for small data size Domingos
  • Low variance means less likely to overfit
  • Generative


  • Hinge loss
  • Maximum margin
  • Quite a few nonlinear kernels (probably better than LR with a nonlinear transformation)
  • Good for high dimensional space.
  • Reported better for text classification problems
  • Inefficient training. Not for industry-scale applications
  • Memory intensive
  • Discriminative

Tree Ensembles

  • Good for high dimensional space
  • Good for nonlinear variables
  • Good for categorical variables
  • Non-parametric. Don’t need to worry about outliers or linearly separable.
  • Discriminative

Random Forest

  • Usually work out of box

Gradient Boost Tree

  • Generally perform better, if getting it right
  • More hyper-parameters to tune
  • Prone to overfitting
  • Not so good for multi-class problems. # of trees is n_classes * n_estimators.

Deep Learning

  • Not general-purpose
  • Applied when you believe you can still squeeze more after trying above

Interesting reading

Generative vs Discriminative Modeling


  • Models distribution: P(X,Y) or ( P(XY) and P(Y) )
  • More straightforward to detect shifts in distribution
  • Easier to detect outliers
  • The assumption of distributions tends to prevent overfitting


  • Models classification splits

Some side notes

  • tends to overfit == high variance (of the parameter estimates)

Off topic below

Deep learning development

Supervised learning

  • Better initialization: introduced in 2010
  • ReLu: introduced in 2011
  • Dropout: a better regularizer in 2014
  • RNN cannot remember things for more than a couple of steps long back then.
  • Momentum
  • Dropcoeonnect
  • Maxout

Unsupervised learning

  • Pure unsupervised learning
  • Transfer learning
  • Semi-supervised learning
  • Domain adaptation
  • Self-taught learning

ML in late 90’s

  • SVM and convex optimization
  • L1 regularization and sparsity
  • Automated feature detecteion like NMF

Potentially interesting readings

Some quotes

Better data often beats better algorithms.

All learning is non-convex.


Leave a Comment