## What’s the idea of SVM?

SVM (also called *Maximum Margin Classifier*) is an algorithm that takes the data as an input and outputs a line/hyperplane that separates those classes if possible.

Suppose that we need to separate two classes of a dataset. The task is to find a line to separate them. However, there are many lines which can do that (countless number of lines). How can we choose the best one?

*An idea of support vectors (samples on the margin) and SVM (find the optimal hyperplane).*

## Using SVM with kernel trick

Most of the time, we cannot separate classes in the current dataset easily (not linearly separable data). We need to use **kernel trick** first (transform from the current dimension to a higher dimension) and then we use SVM. These classes are not linearly separable.

*An idea of kernel and SVM. Transform from 1D to 2D. Data is not linearly separable in the input space but it is linearly separable in the feature space obtained by a kernel.*

*An idea of kernel and SVM. Transform from 2D to 3D. Data is not linearly separable in the input space but it is linearly separable in the feature space obtained by a kernel.*

A **kernel** is a dot product in some feature space:

$$ K(\mathbf{x}_i, \mathbf{x}_j) = \Phi(\mathbf{x}_i, \mathbf{x}_j). $$

It also measures **the similarity** between two points $\mathbf{x}_i$ and $\mathbf{x}_j$.

We have some popular kernels,

**Linear kernel**: $K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i \cdot \mathbf{x}_j$. We use`kernel = 'linear'`

in`sklearn.svm.SVM`

. Linear kernels are rarely used in practice.**Gaussian kernel**(or): $K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma\Vert \mathbf{x}_i - \mathbf{x}_j \Vert^2)$. It’s used the most. We use*Radial Basic Function*– RBF`kernel = 'rbf'`

(default) with keyword`gamma`

for $\gamma$ (must be greater than $0$) in`sklearn.svm.SVM`

.**Exponential kernel**: $K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma\Vert \mathbf{x}_i - \mathbf{x}_j \Vert)$.**Polynomial kernel**: $K(\mathbf{x}_i, \mathbf{x}_j) = (r+\gamma\mathbf{x}_i \cdot \mathbf{x}_j)^d$. We use`kernel = 'poly'`

with keyword`degree`

for $d$ and`coef0`

for $r$ in`sklearn.svm.SVM`

. It’s more popular than RBF in NLP. The most common degree is $d = 2$ (quadratic), since larger degrees tend to overfit on NLP problems.^{[ref]}**Hybrid kernel**: $K(\mathbf{x}_i, \mathbf{x}_j) = (p+\mathbf{x}_i \cdot \mathbf{x}_j)^q\exp(-\gamma\Vert \mathbf{x}_i - \mathbf{x}_j \Vert^2)$.**Sigmoidal**: $K(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\gamma\mathbf{x}_i \cdot \mathbf{x}_j+r)$. We use`kernel = 'sigmoid'`

with keyword`coef0`

for $r$ in`sklearn.svm.SVM`

.

We can also define a custom kernel thanks to this help.

Choose whatever kernel performs best on cross-validation data. Andrew NG said in his ML course.

## Good or Bad?

Advantages:

- Compared to both logistic regression and NN, a SVM sometimes gives a cleaner way of learning non-linear functions.
- SVM is better than NN with 1 layer (Perceptron Learning Algorithm) thanks to the largest margin between 2 classes.
- Accurate in high-dimensional spaces + memory effecient.
- Good accuracy and perform faster prediction compared to Naïve Bayes algorithm.
^{[ref]}

Disadvantages:

- Prone to overfitting: if number of features are larger than number of samples.
- Don’t provide probability estimation.
- Not efficient if your data is very big!
- It works poorly with overlapping classes
- Sensitive to the type of kernel used.

## SVM used for?^{[ref]}

- Classification, regression and outliers detection.
- Face detection.
- Text and hypertext categorization.
- Detecting spam.
- Classification of images.
- Bioinformatics.

## Using SVM with Scikit-learn

```
from sklearn.svm import SVC
svc = SVC(kernel='linear') # default = 'rbf' (Gaussian kernel)
# other kernels: poly, sigmoid, precomputed or a callable
svc = svc.fit(X, y)
svc.predict(X)
# gives the support vectors
svc.support_vectors_
```

There are other parameters of `sklearn.svm.SVM`

.

In the case of **linear SVM**, we can also use `sklearn.svm.LinearSVC`

. It’s similar to `sklearn.svm.SVG`

with `kernel='linear'`

but implemented in terms of `liblinear`

rather than `libsvm`

, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.^{[ref]}

### Meaning of some parameters

**The Regularization parameter** (`C`

, default `C=1.0`

): if `C`

is larger, hyperplane has smaller margin but do a better job of classification and otherwise. This is how you can control the trade-off between decision boundary and misclassification term.

**Higher values**of`C`

$\Rightarrow$ a higher possibility of overfitting, the softmargin SVM is equivalent to the hard-margin SVM.**Lower values**of`C`

$\Rightarrow$ a higher possibility of underfitting. We admit misclassifications in the training data

We use this in the case of *not linearly separable data*; It’s also called **soft-margin linear SVM**.

*An illustration of using C.*

*An illustration of using C. Bigger C, smaller margin.*

^{[ref]}

**Gamma** (`gamma`

, default `gamma='auto'`

which uses `1/n_features`

): determine the number of points to construct the hyperplane.

*An illustration of using gamma. In low-gamma case, we only consider points nearby the hyperplane, it may cause an overfitting.*

*Bigger gamma, more change to get overfitting (in a XOR problem).*

## SVM in action

## References

**Scikit-learn**– SVM official doc.**Simplilearn**– How Support Vector Machine Works | SVM In Machine Learning.**Tiep Vu**– Bài 19: Support Vector Machine.**Jeremy Kun**– Formulating the Support Vector Machine Optimization Problem.**Tiep Vu**– Bài 20: Soft Margin Support Vector Machine.**Tiep Vu**- Bài 21: Kernel Support Vector Machine.**Alexander Statnikov**,**Douglas Hardin**,**Isabelle Guyon**,**Constantin F. Aliferis**– A Gentle Introduction to Support Vector Machines in Biomedicine.**Jake VanderPlas**– In-Depth: Support Vector Machines. – Example: How to code and illustrate hyperplane and support vectors in Python?**Chris Albon**– Notes about Support Vector Machines.**Andrew NG**– My raw note when I learned the course Machine Learning on Coursera.