Support Vector Machine (SVM)

10-01-2021 / Edit on Github

What's the idea of SVM? #

SVM (also called Maximum Margin Classifier) is an algorithm that takes the data as an input and outputs a line/hyperplane that separates those classes if possible.

Suppose that we need to separate two classes of a dataset. The task is to find a line to separate them. However, there are many lines which can do that (countless number of lines). How can we choose the best one?

An idea of support vectors and SVM.
An idea of support vectors (samples on the margin) and SVM (find the optimal hyperplane).

More mathematical details on finding a hyperplane

We need to find a hyperplane (H)(H): wTx+b=0\mathbf{w}^T\mathbf{x} + b = 0 where the weights w=(w1,…,wd)\mathbf{w}=(w_1,\ldots,w_d) and a point x=(x1,…,xd)\mathbf{x}=(x_1,\ldots,x_d). For example, in 2D (d=2d=2), we need to find a hyperplane w1x1+w2x2+b=0w_1x_1 + w_2x_2 + b=0. Note also that, the distance between a point x0\mathbf{x}_0 and (H)(H) is given by,

d(x0,H)=∣wTx0+b∣βˆ₯wβˆ₯2,(1)\text{d}(\mathbf{x}_0, H) = \frac{\vert\mathbf{w}^T\mathbf{x}_0 + b\vert}{\Vert\mathbf{w}\Vert_2}, \quad (1)

where βˆ₯wβˆ₯2=βˆ‘i=1dwi2\Vert\mathbf{w}\Vert_2 = \sqrt{\sum_{i=1}^d w_i^2}.

In order to understand the idea, we consider a 2D case with the classification of 2 classes (blue faces are numbered as "+1+1" and orange faces are numbered as "βˆ’1-1").

Finding a hyperplane.
We need to find an optimal hyperplane between 2 classes (find w1,w2w_1, w_2 and bb).

Recall that a margin (geometric margin) is the minimum distance between a hyperplane and the closest point(s) to it. Thanks to (1)(1), we can find the margin to a hyperplane by determining the closet distance from an arbitrary points (xi,yi)(\mathbf{x}_i, y_i) to that hyperplane via,

margin=min⁑iyi(wTxi+b)∣∣w∣∣2.(2)\text{margin} = \min_i \frac{y_i(\mathbf{w}^T\mathbf{x}_i + b)}{||\mathbf{w}||_2}. \quad (2)

Note that, because yiy_i takes values βˆ’1-1 or +1+1 and it always has the same sign as wTxi+b\mathbf{w}^T\mathbf{x}_i + b, yi(wTxi+b)y_i(\mathbf{w}^T\mathbf{x}_i + b) is always positive.

The SVM problem is to find (w,b\mathbf{w}, b) so that the hard-margin (2)(2) has the maximum value, i.e.,

(w,b)=arg⁑max⁑w,b(min⁑iyi(wTxi+b)βˆ₯wβˆ₯2)=arg⁑max⁑w,b(1βˆ₯wβˆ₯2min⁑iyi(wTxi+b))(3)\begin{aligned} (\mathbf{w}, b) &= \arg \max_{\mathbf{w}, b} \left( \min_i \frac{y_i(\mathbf{w}^T\mathbf{x}_i + b)}{\Vert\mathbf{w}\Vert_2} \right)\\ &= \arg \max_{\mathbf{w}, b}\left( \frac{1}{\Vert\mathbf{w}\Vert_2} \min_i y_i(\mathbf{w}^T\mathbf{x}_i + b) \right) \end{aligned} \quad (3)

(arg⁑\arg means you need to find w,b\mathbf{w},b so that the function reaches the max⁑\max.)

This is an optimal problem. Another remark is that we can multiply both sides of (H)(H) by any real number k≠0k\ne 0, we obtain the same (H)(H). With that reason, we can suppose that,

yi(wTxi+b)=1,y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1,

for all points on the hyperplane, and the problem (3)(3) becomes,

(w,b)=arg⁑max⁑w,b1βˆ₯wβˆ₯2subject to  yi(wTxi+b)β‰₯1,βˆ€i=1,2,…,N.\begin{aligned} &(\mathbf{w}, b) = \arg\max_{\mathbf{w}, b} \frac{1}{\Vert\mathbf{w}\Vert_2} \\ \text{subject to} &~~ y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, \forall i = 1, 2, \dots, N. \end{aligned}

The second line due to the fact that the closet points have distance 11 to the hyperplane, the other points have distance greater than 11. We can write above problem as,

(w,b)=arg⁑min⁑w,b12βˆ₯wβˆ₯22subject to  yi(wTxi+b)βˆ’1β‰₯0,βˆ€i=1,2,…,N.(4)\begin{aligned} &(\mathbf{w}, b) = \arg\min_{\mathbf{w}, b} \frac{1}{2}\Vert\mathbf{w}\Vert_2^2 \\ \text{subject to} &~~ y_i(\mathbf{w}^T\mathbf{x}_i + b)-1 \geq 0, \forall i = 1, 2, \dots, N. \end{aligned} \quad (4)

This is called "primal formulation of linear SVMs". In mathematical optimization, one can prove that the problem (4)(4) has an unique solution (We can get an unique hyperplane (H)(H) which satisfies the classification problem). Such problems are generally called quadratic programming problems.

Problem (4)(4) can be solved "more easily" by considering its dual formulation. Apply the method of Lagrange multipliers, we define a Lagrangian,

Ξ›(w,b,Ξ»)=12βˆ₯wβˆ₯22βˆ’βˆ‘i=1NΞ»i(yi(wTxi+b)βˆ’1),\Lambda(\mathbf{w},b,\lambda) = \dfrac{1}{2}\Vert \mathbf{w}\Vert_2^2 - \sum_{i=1}^N \lambda_i(y_i(\mathbf{w}^T\mathbf{x}_i+b)-1),

where w,xi\mathbf{w}, \mathbf{x}_i are vectors with dd elements and Ξ»\lambda is a vector with NN elements. We need to minimize this Lagrangian w.r.t. w,b\mathbf{w}, b and simultaneously require that the derivative w.r.t. Ξ»\lambda vanishes, all subject to the constraints that Ξ»iβ‰₯0\lambda_i \ge 0. If we set the derivatives w.r.t. w,b\mathbf{w}, b, we obtain,

βˆ‚Ξ›(w,b,Ξ»)βˆ‚b=0β‡’βˆ‘i=1NΞ»iyi=0,βˆ‚Ξ›(w,b,Ξ»)βˆ‚w=0β‡’w=βˆ‘i=1NΞ»iyixi.\begin{aligned} \dfrac{\partial \Lambda(\mathbf{w},b,\lambda)}{\partial b} &= 0 \Rightarrow \sum_{i=1}^N\lambda_iy_i=0, \\ \dfrac{\partial \Lambda(\mathbf{w},b,\lambda)}{\partial \mathbf{w}} &= 0 \Rightarrow \mathbf{w}=\sum_{i=1}^N\lambda_iy_i\mathbf{x}_i. \end{aligned}

We substitute the above into the equation for Ξ“(w,b,Ξ»)\Gamma(\mathbf{w},b,\lambda) and obtain "dual formulation of linear SVMs",

Ξ»=arg⁑max⁑λ(βˆ‘i=1NΞ»iβˆ’12βˆ‘i=1Nβˆ‘j=1NΞ»iΞ»jyiyjxiTxj)subject to  Ξ»iβ‰₯0,βˆ‘i=1NΞ»iyi=0, βˆ€i=1,…,N.(5)\begin{aligned} &\lambda = \arg\max_{\lambda} \left( \sum_{i=1}^N \lambda_i -\frac{1}{2}\sum_{i=1}^N \sum_{j=1}^N \lambda_i\lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j \right) \\ \text{subject to} &~~ \lambda_i \ge 0, \sum_{i=1}^N\lambda_iy_i = 0, ~\forall i=1,\ldots,N. \end{aligned} \quad (5)

in that, w\mathbf{w} is defined in terms of Ξ»i\lambda_i: w=βˆ‘1NΞ»iyixi\mathbf{w} = \sum_1^N\lambda_iy_i\mathbf{x}_i and the solution becomes

f(x)=sign(Ξ£1NΞ»iyixiTxj+b).f(\mathbf{x})=\text{sign}(\Sigma_1^N\lambda_iy_i\mathbf{x}_i^T\mathbf{x}_j + b).

Then given a new instance x\mathbf{x}, the classifier is,

f(x)=sign(wTx+b).f(\mathbf{x})=\text{sign}(\mathbf{w}^T\mathbf{x} + b).

The benefits of using dual formulation are:[ref, slide 52]

  • No need to access original data, need to access only dot products xiTxj\mathbf{x}_i^T\mathbf{x}_j.
  • Number of free parameters is bounded by the number of support vectors and not by the number of variables
    (beneficial for high-dimensional problems).

Read more in this post (Vietnamese), this slide or this post.

Using SVM with kernel trick #

Most of the time, we cannot separate classes in the current dataset easily (not linearly separable data). We need to use kernel trick first (transform from the current dimension to a higher dimension) and then we use SVM. These classes are not linearly separable.

An idea of kernel and SVM (1D to 2D).
An idea of kernel and SVM. Transform from 1D to 2D. Data is not linearly separable in the input space but it is linearly separable in the feature space obtained by a kernel.

An idea of kernel and SVM (2D to 3D).
An idea of kernel and SVM. Transform from 2D to 3D. Data is not linearly separable in the input space but it is linearly separable in the feature space obtained by a kernel.

More mathematical details

Original data x\mathbf{x} (in input space),[ref, slide 59]

f(x)=sign(wTx+b),w=βˆ‘i=1NΞ»iyixi\begin{aligned} f(\mathbf{x}) =\text{sign}(\mathbf{w}^T\mathbf{x} + b), \quad \mathbf{w} =\sum_{i=1}^N\lambda_iy_i\mathbf{x}_i \end{aligned}

Data in a higher dimensional feature space Ξ¦(x)\Phi(\mathbf{x}),

f(x)=sign(wTΞ¦(x)+b),w=βˆ‘i=1NΞ»iyiΞ¦(xi)\begin{aligned} f(\mathbf{x}) =\text{sign}(\mathbf{w}^T\Phi(\mathbf{x}) + b), \quad \mathbf{w} =\sum_{i=1}^N\lambda_iy_i\Phi(\mathbf{x}_i) \end{aligned}

We can rewrite f(x)f(\mathbf{x}) as,

f(x)=sign(βˆ‘i=1NΞ»iyiΞ¦(xi)TΞ¦(x)+b),f(\mathbf{x}) =\text{sign} \left( \sum_{i=1}^N\lambda_iy_i\Phi(\mathbf{x}_i)^T\Phi(\mathbf{x}) + b \right),


f(x)=sign(βˆ‘i=1NΞ»iyiK(xi,x)+b),K(xi,x)=Ξ¦(xi)TΞ¦(x).\begin{aligned} f(\mathbf{x}) &=\text{sign}(\sum_{i=1}^N\lambda_iy_iK(\mathbf{x}_i,\mathbf{x}) + b), \\ K(\mathbf{x}_i,\mathbf{x}) &= \Phi(\mathbf{x}_i)^T\Phi(\mathbf{x}). \end{aligned}

Therefore, we do not need to know Φ\Phi explicitly, we just need to define a kernel function K(⋅,⋅):Rd×Rd→RK(\cdot,\cdot): \mathbb{R}^d\times \mathbb{R}^d \to \mathbb{R}. However, not every function Rd×Rd→R\mathbb{R}^d\times \mathbb{R}^d \to \mathbb{R} can be a valid kernel. It has to satisfy so-called Mercer conditions. Otherwise, the underlying quadratic program may not be solvable.

A kernel is a dot product in some feature space:

K(xi,xj)=Ξ¦(xi,xj).K(\mathbf{x}_i, \mathbf{x}_j) = \Phi(\mathbf{x}_i, \mathbf{x}_j).

It also measures the similarity between two points xi\mathbf{x}_i and xj\mathbf{x}_j.

We have some popular kernels,

  • Linear kernel: K(xi,xj)=xiβ‹…xjK(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i \cdot \mathbf{x}_j. We use kernel = 'linear' in sklearn.svm.SVM. Linear kernels are rarely used in practice.
  • Gaussian kernel (or Radial Basic Function -- RBF): K(xi,xj)=exp⁑(βˆ’Ξ³βˆ₯xiβˆ’xjβˆ₯2)K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma\Vert \mathbf{x}_i - \mathbf{x}_j \Vert^2). It's used the most. We use kernel = 'rbf' (default) with keyword gamma for Ξ³\gamma (must be greater than 00) in sklearn.svm.SVM.
  • Exponential kernel: K(xi,xj)=exp⁑(βˆ’Ξ³βˆ₯xiβˆ’xjβˆ₯)K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma\Vert \mathbf{x}_i - \mathbf{x}_j \Vert).
  • Polynomial kernel: K(xi,xj)=(r+Ξ³xiβ‹…xj)dK(\mathbf{x}_i, \mathbf{x}_j) = (r+\gamma\mathbf{x}_i \cdot \mathbf{x}_j)^d. We use kernel = 'poly' with keyword degree for dd and coef0 for rr in sklearn.svm.SVM. It's more popular than RBF in NLP. The most common degree is d=2d = 2 (quadratic), since larger degrees tend to overfit on NLP problems.[ref]
  • Hybrid kernel: K(xi,xj)=(p+xiβ‹…xj)qexp⁑(βˆ’Ξ³βˆ₯xiβˆ’xjβˆ₯2)K(\mathbf{x}_i, \mathbf{x}_j) = (p+\mathbf{x}_i \cdot \mathbf{x}_j)^q\exp(-\gamma\Vert \mathbf{x}_i - \mathbf{x}_j \Vert^2).
  • Sigmoidal: K(xi,xj)=tanh⁑(Ξ³xiβ‹…xj+r)K(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\gamma\mathbf{x}_i \cdot \mathbf{x}_j+r). We use kernel = 'sigmoid' with keyword coef0 for rr in sklearn.svm.SVM.

We can also define a custom kernel thanks to this help.

Choose whatever kernel performs best on cross-validation data. Andrew NG said in his ML course.

Examples of choosing a kernel

SVM in XOR problem.
Using SVM with 3 different kernels in a XOR problem. In this case, Gaussian kernel is the choice.[ref]

In the case of almost linearly separable data.
Using SVM with 3 different kernels in the case of almost linearly separable data. In this case, Polynomial kernel is the choice.

Good or Bad? #


  • Compared to both logistic regression and NN, a SVM sometimes gives a cleaner way of learning non-linear functions.
  • SVM is better than NN with 1 layer (Perceptron Learning Algorithm) thanks to the largest margin between 2 classes.
  • Accurate in high-dimensional spaces + memory effecient.
  • Good accuracy and perform faster prediction compared to NaΓ―ve Bayes algorithm.[ref]


  • Prone to overfitting: if number of features are larger than number of samples.
  • Don't provide probability estimation.
  • Not efficient if your data is very big!
  • It works poorly with overlapping classes
  • Sensitive to the type of kernel used.

SVM used for? #

Some points:[re]

  • Classification, regression and outliers detection.
  • Face detection.
  • Text and hypertext categorization.
  • Detecting spam.
  • Classification of images.
  • Bioinformatics.

Using SVM with Scikit-learn #

from sklearn.svm import SVC

svc = SVC(kernel='linear') # default = 'rbf' (Gaussian kernel)
# other kernels: poly, sigmoid, precomputed or a callable

svc =, y)

# gives the support vectors

There are other parameters of sklearn.svm.SVM.

In the case of linear SVM, we can also use sklearn.svm.LinearSVC. It's similar to sklearn.svm.SVG with kernel='linear' but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.[ref]

Meaning of some parameters #

The Regularization parameter (C, default C=1.0): if C is larger, hyperplane has smaller margin but do a better job of classification and otherwise. This is how you can control the trade-off between decision boundary and misclassification term.

  • Higher values of C β‡’\Rightarrow a higher possibility of overfitting, the softmargin SVM is equivalent to the hard-margin SVM.
  • Lower values of C β‡’\Rightarrow a higher possibility of underfitting. We admit misclassifications in the training data

We use this in the case of not linearly separable data; It's also called soft-margin linear SVM.

The Regularization parameter.
An illustration of using C.

The Regularization parameter.
An illustration of using C. Bigger C, smaller margin.[ref]

More mathematical details on CC and soft-margin problems

Recall that the hard-margin problem is,

(w,b)=arg⁑min⁑w,b12βˆ₯wβˆ₯22subject to  yi(wTxi+b)βˆ’1β‰₯0,βˆ€i=1,2,…,N.(4 revisited)\begin{aligned} &(\mathbf{w}, b) = \arg \min_{\mathbf{w}, b} \frac{1}{2}\Vert\mathbf{w}\Vert_2^2 \\ \text{subject to} &~~ y_i(\mathbf{w}^T\mathbf{x}_i + b) -1 \geq 0, \forall i = 1, 2, \dots, N. \end{aligned} \quad (4~\text{revisited})

and its duality is,

Ξ»=arg⁑max⁑λ(βˆ‘i=1NΞ»iβˆ’12βˆ‘i=1Nβˆ‘j=1NΞ»iΞ»jyiyjxiTxj)subject to  Ξ»iβ‰₯0,βˆ‘i=1NΞ»iyi=0, βˆ€i=1,…,N.(5 revisited)\begin{aligned} &\lambda = \arg\max_{\lambda} \left( \sum_{i=1}^N \lambda_i -\frac{1}{2}\sum_{i=1}^N \sum_{j=1}^N \lambda_i\lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j \right) \\ \text{subject to} &~~ \lambda_i \ge 0, \sum_{i=1}^N\lambda_iy_i = 0, ~\forall i=1,\ldots,N. \end{aligned} \quad (5~\text{revisited})

Instead of considering a hard-margin (4)(4), we consider following soft-margin problem (with the addition of slack variables)

The slack variables.
An introduction to slack variables.[ref]

(w,b,ΞΎ)=arg⁑min⁑w,b,ΞΎ(12βˆ₯wβˆ₯22+Cβˆ‘i=1NΞΎi)subject to  yi(wTxi+b)β‰₯1βˆ’ΞΎi,βˆ€i=1,…,N  ΞΎi≀0, βˆ€i=1,2,…,N\begin{aligned} &(\mathbf{w}, b, \xi) = \arg\min_{\mathbf{w}, b, \xi} \left( \frac{1}{2}{\Vert\mathbf{w}\Vert_2^2} + C \sum_{i=1}^N \xi_i \right) \\ \text{subject to} &~~ y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i, \forall i = 1, \dots, N \\ &~~ \xi_i \leq 0, ~\forall i = 1, 2, \dots, N \end{aligned}

With these new slack variables, we have to decide the trade-off between maximizing the margin (term 12βˆ₯wβˆ₯22\frac{1}{2}\Vert \mathbf{w}\Vert_2^2) and minimizing the mistakes (term CΞ£1nΞΎiC\Sigma_1^n\xi_i).

When CC is big, the term 12βˆ₯wβˆ₯22\frac{1}{2}\Vert \mathbf{w}\Vert_2^2 is almost considered as 00 in the minimized problem and the problem focuses on minimizing the term CΞ£1nΞΎiC\Sigma_1^n\xi_i (avoiding misclassification). That's why the margin looks more narrow in the case of bigger CC.

A dual formulation of above soft-margin problem is,

Ξ»=arg⁑max⁑λ(βˆ‘i=1NΞ»iβˆ’12βˆ‘i=1Nβˆ‘j=1NΞ»iΞ»jyiyjxiTxj)subject to  0≀λi≀C,βˆ‘i=1NΞ»iyi=0, βˆ€i=1,…,N.(6)\begin{aligned} &\lambda = \arg\max_{\lambda} \left( \sum_{i=1}^N \lambda_i -\frac{1}{2}\sum_{i=1}^N \sum_{j=1}^N \lambda_i\lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j \right) \\ \text{subject to} &~~ 0 \le \lambda_i \le C, \sum_{i=1}^N\lambda_iy_i = 0,~\forall i=1,\ldots,N. \end{aligned} \quad (6)

Note that, (6)(6) looks like (5)(5) (duality of hard-margin problem) but condition 0≀λi≀C0 \le \lambda_i \le C.

Gamma (gamma, default gamma='auto' which uses 1/n_features): determine the number of points to construct the hyperplane.

The parameter gamma.
An illustration of using gamma. In high-gamma case, we only consider points nearby the hyperplane, it may cause an overfitting.

gamma parameter.
Bigger gamma, more change to get overfitting (in a XOR problem).

Understanding the idea of gamma in the Gaussian kernel

We have another form of Gaussian kernel which is,

K(xi,xj)=exp(βˆ’βˆ₯xiβˆ’xjβˆ₯22Οƒ2),K(\mathbf{x}_i, \mathbf{x}_j) = \text{exp}\left( - \dfrac{\Vert \mathbf{x}_i - \mathbf{x}_j \Vert^2}{2\sigma^2} \right),

where Οƒ\sigma is the standard deviation which shows the spread of the data.

Compare to the one used in the scikit-learn, K(xi,xj)=exp⁑(βˆ’Ξ³βˆ₯xiβˆ’xjβˆ₯2)K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma\Vert \mathbf{x}_i - \mathbf{x}_j \Vert^2), we see that Ξ³\gamma is an inverse of Οƒ\sigma. It implies,

  • When Οƒ\sigma is bigger (or Ξ³\gamma is smaller), the similarity between two points xi\mathbf{x}_i and xj\mathbf{x}_j are considered in a wide range (spreading widely).
  • Conversely, when Οƒ\sigma is smaller (or Ξ³\gamma is bigger), only two points xi\mathbf{x}_i and xj\mathbf{x}_j which are really near to each other will be considered to be similar. That's why we see there are many "groups" in the figure corresponding to Ξ³=100\gamma=100. It leads to an overfitting problem.

SVM in action #

  • XOR problem to see the effect of gamma and C in the case of using RBF kernel: html file -- open in colab.

  • Face Recognition[ref]: html file -- open in colab.

    What's interesting?

    Some points:[ref]

    • We will use a principal component analysis to extract 150 fundamental components to feed into our support vector machine classifier.
    • Grid search cross-validation to explore combinations of parameters.
    • For a real-world facial recognition task, in which the photos do not come precropped into nice grids, the only difference in the facial classification scheme is the feature selection: you would need to use a more sophisticated algorithm to find the faces, and extract features that are independent of the pixellation. For this kind of application, one good option is to make use of OpenCV, which, among other things, includes pre-trained implementations of state-of-the-art feature extraction tools for images in general and faces in particular.
    • I generally only turn to SVMs once other simpler, faster, and less tuning-intensive methods have been shown to be insufficient for my needs. Nevertheless, if you have the CPU cycles to commit to training and cross-validating an SVM on your data, the method can lead to excellent results.

References #