What's the idea of SVM?
SVM (also called Maximum Margin Classifier) is an algorithm that takes the data as an input and outputs a line/hyperplane that separates those classes if possible.
Suppose that we need to separate two classes of a dataset. The task is to find a line to separate them. However, there are many lines which can do that (countless number of lines). How can we choose the best one?
An idea of support vectors (samples on the margin) and SVM (find the optimal hyperplane).
We need to find a hyperplane : where the weights and a point . For example, in 2D (), we need to find a hyperplane . Note also that, the distance between a point and is given by,
In order to understand the idea, we consider a 2D case with the classification of 2 classes (blue faces are numbered as "" and orange faces are numbered as "").
We need to find an optimal hyperplane between 2 classes (find and ).
Recall that a margin (geometric margin) is the minimum distance between a hyperplane and the closest point(s) to it. Thanks to , we can find the margin to a hyperplane by determining the closet distance from an arbitrary points to that hyperplane via,
Note that, because takes values or and it always has the same sign as , is always positive.
The SVM problem is to find () so that the hard-margin has the maximum value, i.e.,
( means you need to find so that the function reaches the .)
This is an optimal problem. Another remark is that we can multiply both sides of by any real number , we obtain the same . With that reason, we can suppose that,
for all points on the hyperplane, and the problem becomes,
The second line due to the fact that the closet points have distance to the hyperplane, the other points have distance greater than . We can write above problem as,
This is called "primal formulation of linear SVMs". In mathematical optimization, one can prove that the problem has an unique solution (We can get an unique hyperplane which satisfies the classification problem). Such problems are generally called quadratic programming problems.
where are vectors with elements and is a vector with elements. We need to minimize this Lagrangian w.r.t. and simultaneously require that the derivative w.r.t. vanishes, all subject to the constraints that . If we set the derivatives w.r.t. , we obtain,
We substitute the above into the equation for and obtain "dual formulation of linear SVMs",
in that, is defined in terms of : and the solution becomes
Then given a new instance , the classifier is,
The benefits of using dual formulation are:[ref, slide 52]
- No need to access original data, need to access only dot products .
- Number of free parameters is bounded by the number of support vectors and not by the number of variables (beneficial for high-dimensional problems).
Using SVM with kernel trick
Most of the time, we cannot separate classes in the current dataset easily (not linearly separable data). We need to use kernel trick first (transform from the current dimension to a higher dimension) and then we use SVM. These classes are not linearly separable.
An idea of kernel and SVM. Transform from 1D to 2D. Data is not linearly separable in the input space but it is linearly separable in the feature space obtained by a kernel.
An idea of kernel and SVM. Transform from 2D to 3D. Data is not linearly separable in the input space but it is linearly separable in the feature space obtained by a kernel.
Original data (in input space),[ref, slide 59]
Data in a higher dimensional feature space ,
We can rewrite as,
Therefore, we do not need to know explicitly, we just need to define a kernel function . However, not every function can be a valid kernel. It has to satisfy so-called Mercer conditions. Otherwise, the underlying quadratic program may not be solvable.
A kernel is a dot product in some feature space:
It also measures the similarity between two points and .
We have some popular kernels,
- Linear kernel: . We use
kernel = 'linear'in
sklearn.svm.SVM. Linear kernels are rarely used in practice.
- Gaussian kernel (or Radial Basic Function -- RBF): . It's used the most. We use
kernel = 'rbf'(default) with keyword
gammafor (must be greater than ) in
- Exponential kernel: .
- Polynomial kernel: . We use
kernel = 'poly'with keyword
sklearn.svm.SVM. It's more popular than RBF in NLP. The most common degree is (quadratic), since larger degrees tend to overfit on NLP problems.[ref]
- Hybrid kernel: .
- Sigmoidal: . We use
kernel = 'sigmoid'with keyword
We can also define a custom kernel thanks to this help.
Choose whatever kernel performs best on cross-validation data. Andrew NG said in his ML course.
Good or Bad?
- Compared to both logistic regression and NN, a SVM sometimes gives a cleaner way of learning non-linear functions.
- SVM is better than NN with 1 layer (Perceptron Learning Algorithm) thanks to the largest margin between 2 classes.
- Accurate in high-dimensional spaces + memory effecient.
- Good accuracy and perform faster prediction compared to Naïve Bayes algorithm.[ref]
- Prone to overfitting: if number of features are larger than number of samples.
- Don't provide probability estimation.
- Not efficient if your data is very big!
- It works poorly with overlapping classes
- Sensitive to the type of kernel used.
SVM used for?
- Classification, regression and outliers detection.
- Face detection.
- Text and hypertext categorization.
- Detecting spam.
- Classification of images.
Using SVM with Scikit-learn
from sklearn.svm import SVC
svc = SVC(kernel='linear') # default = 'rbf' (Gaussian kernel)
# other kernels: poly, sigmoid, precomputed or a callable
svc = svc.fit(X, y)
# gives the support vectors
There are other parameters of
In the case of linear SVM, we can also use
sklearn.svm.LinearSVC. It's similar to
kernel='linear' but implemented in terms of
liblinear rather than
libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.[ref]
Meaning of some parameters
The Regularization parameter (
C is larger, hyperplane has smaller margin but do a better job of classification and otherwise. This is how you can control the trade-off between decision boundary and misclassification term.
- Higher values of
Ca higher possibility of overfitting, the softmargin SVM is equivalent to the hard-margin SVM.
- Lower values of
Ca higher possibility of underfitting. We admit misclassifications in the training data
We use this in the case of not linearly separable data; It's also called soft-margin linear SVM.
An illustration of using
An illustration of using
C, smaller margin.[ref]
Recall that the hard-margin problem is,
and its duality is,
Instead of considering a hard-margin , we consider following soft-margin problem (with the addition of slack variables)
An introduction to slack variables.[ref]
With these new slack variables, we have to decide the trade-off between maximizing the margin (term ) and minimizing the mistakes (term ).
When is big, the term is almost considered as in the minimized problem and the problem focuses on minimizing the term (avoiding misclassification). That's why the margin looks more narrow in the case of bigger .
A dual formulation of above soft-margin problem is,
Note that, looks like (duality of hard-margin problem) but condition .
gamma='auto' which uses
1/n_features): determine the number of points to construct the hyperplane.
An illustration of using
gamma. In high-gamma case, we only consider points nearby the hyperplane, it may cause an overfitting.
Bigger gamma, more change to get overfitting (in a XOR problem).
gammain the Gaussian kernel
We have another form of Gaussian kernel which is,
where is the standard deviation which shows the spread of the data.
Compare to the one used in the scikit-learn, , we see that is an inverse of . It implies,
- When is bigger (or is smaller), the similarity between two points and are considered in a wide range (spreading widely).
- Conversely, when is smaller (or is bigger), only two points and which are really near to each other will be considered to be similar. That's why we see there are many "groups" in the figure corresponding to . It leads to an overfitting problem.
SVM in action
- What's interesting?
- We will use a principal component analysis to extract 150 fundamental components to feed into our support vector machine classifier.
- Grid search cross-validation to explore combinations of parameters.
- For a real-world facial recognition task, in which the photos do not come precropped into nice grids, the only difference in the facial classification scheme is the feature selection: you would need to use a more sophisticated algorithm to find the faces, and extract features that are independent of the pixellation. For this kind of application, one good option is to make use of OpenCV, which, among other things, includes pre-trained implementations of state-of-the-art feature extraction tools for images in general and faces in particular.
- I generally only turn to SVMs once other simpler, faster, and less tuning-intensive methods have been shown to be insufficient for my needs. Nevertheless, if you have the CPU cycles to commit to training and cross-validating an SVM on your data, the method can lead to excellent results.
- Scikit-learn -- SVM official doc.
- Simplilearn -- How Support Vector Machine Works | SVM In Machine Learning.
- Tiep Vu -- Bài 19: Support Vector Machine.
- Jeremy Kun -- Formulating the Support Vector Machine Optimization Problem.
- Tiep Vu -- Bài 20: Soft Margin Support Vector Machine.
- Tiep Vu - Bài 21: Kernel Support Vector Machine.
- Alexander Statnikov, Douglas Hardin, Isabelle Guyon, Constantin F. Aliferis -- A Gentle Introduction to Support Vector Machines in Biomedicine.
- Jake VanderPlas -- In-Depth: Support Vector Machines. -- Example: How to code and illustrate hyperplane and support vectors in Python?
- Chris Albon -- Notes about Support Vector Machines.
- Andrew NG -- My raw note when I learned the course Machine Learning on Coursera.