# Basic Comparison for basic machine learning methods

This article summarizes the following post

## Logistic Regression

- For linearly separable data
- Pretty robust. Can avoid overfitting by l1 or l2 regularization
- Easily distributable
- Get probability. Can do ranking directly
- With l2 regularization, LR can be used as a baseline for any other fancier solutions
- Not so good for categorical variables
- Then go for SVM for Tree Ensembles models
- Can be online. Just update with new data using online gradient descent
- Discriminative model

## Naive Bayes

- Simple. Just do a bunch of counts.
- Can be used as a baseline as well (just don’t dismiss it… yet)
- High bias and low variance. But still useful for small data size Domingos
- Low variance means less likely to overfit
- Generative

## SVM

- Hinge loss
- Maximum margin
- Quite a few nonlinear kernels (probably better than LR with a nonlinear transformation)
- Good for high dimensional space.
- Reported better for text classification problems
- Inefficient training. Not for industry-scale applications
- Memory intensive
- Discriminative

## Tree Ensembles

- Good for high dimensional space
- Good for nonlinear variables
- Good for categorical variables
- Non-parametric. Don’t need to worry about outliers or linearly separable.
- Discriminative

### Random Forest

- Usually work out of box

### Gradient Boost Tree

- Generally perform better, if getting it right
- More hyper-parameters to tune
- Prone to overfitting
- Not so good for multi-class problems. # of trees is
`n_classes * n_estimators`

.

## Deep Learning

- Not general-purpose
- Applied when you believe you can still squeeze more after trying above

## Interesting reading

## Generative vs Discriminative Modeling

### Generative

Models distribution: P(X,Y) or ( P(X Y) and P(Y) ) - More straightforward to detect shifts in distribution
- Easier to detect outliers
- The assumption of distributions tends to prevent overfitting

### Discriminative

- Models classification splits

## Some side notes

- tends to overfit == high variance (of the parameter estimates)

# Off topic below

## Deep learning development

### Supervised learning

- Better initialization: introduced in 2010
- ReLu: introduced in 2011
- Dropout: a better regularizer in 2014
- RNN cannot remember things for more than a couple of steps long back then.
- Momentum
- Dropcoeonnect
- Maxout

### Unsupervised learning

- Pure unsupervised learning
- Transfer learning
- Semi-supervised learning
- Domain adaptation
- Self-taught learning

## ML in late 90’s

- SVM and convex optimization
- L1 regularization and sparsity
- Automated feature detecteion like NMF

## Potentially interesting readings

## Some quotes

Better data often beats better algorithms.

All learning is non-convex.

## Leave a Comment