Support Vector Machine Basics

Published November 08, 2023

Assembled from SVM notes and CQF Exam 3 Question 2


Hyperplane interpretation

Support Vector Machines are best understood through analysis of a hyperplane in standard form:

θTx+θ0=0\boldsymbol{\theta}^T \mathbf{x} + \theta_0 = 0
  • Varying θj>0\theta_{j > 0} has the effect of rotating the hyperplane about the θj=0\theta_j = 0 intercept.

  • Varying θ0\theta_0 has the effect of translating the hyperplane in relation to the origin.

  • Points to one side (in 2D, "above", w.l.o.g.) the hyperplane are those which satisfy θTx+θ0>0\boldsymbol{\theta}^T \mathbf{x} + \theta_0 > 0

  • Equidistant parallel hyperplanes which form a margin can be represented as

    θTx+θ0=±1\boldsymbol{\theta}^T \mathbf{x} + \theta_0 = \pm 1

    Critical to understanding: Scaling the left side of the equation has the effect of expanding and contracting this margin. This means that the constraint yi(θxi+b)1  iy_i (\boldsymbol{\theta} \cdot \mathbf{x}_i + b) \geq 1 \; \forall i can always be satisfied for linearly separable data!

The width of the margin of the SVM is given by 2θ2\frac{2}{\| \boldsymbol{\theta} \|_2}, so SVM maximization of the margin is given by

minθ,θ012θ22subject toyi(θxi+b)1i\begin{aligned} &\min_{\boldsymbol{\theta}, \theta_0} \frac{1}{2} \|\boldsymbol{\theta}\|_2^2 \\ \text{subject to} \quad &y_i (\boldsymbol{\theta} \cdot \mathbf{x}_i + b) \geq 1 \quad \forall i \end{aligned}

For soft margins, the cost function is

J=1Nn=1Nmax(1y(n)(θTx(n)+θ0),0)+λθ22J = \frac{1}{N} \sum_{n=1}^N \max \left( 1 - y^{(n)} \left( \boldsymbol{\theta}^T \mathbf{x}^{(n)} + \theta_0 \right), 0 \right) + \lambda \|\boldsymbol{\theta}\|_2^2

where the regularization parameter λ\lambda determines the relative weight relationship between minimization of margin (the regularization term) and missclassifications.

Regularization parameter

The regularization parameter in a Support Vector Machine is a hyperparameter that determines the relative weight of the squared L2L2 regularization term in the loss function.

In the popular library sklearn, CC is the weight applied to missclassification, so CC is strictly positive and is inversely proportional to the strength of regularization.

The objective function of a standard SVM with a linear kernel, for instance, can be written as:

minw,b12w2+Ci=1nξisubject toyi(wxi+b)1ξi,ξi0,i\begin{aligned} &\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^n \xi_i \\ \text{subject to} \quad &y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0, \quad \forall i \end{aligned}

Regularization induces a penalty for model complexity. In the context of SVMs, the regularization term 12w2\frac{1}{2} \|\mathbf{w}\|^2 promotes model simplicity, i.e. large margins. In the context of soft margins SVMs, larger margins reduce variance and increase bias. Therefore, varying CC has the following effect on the model’s bias/variance trade-off:

  • Larger CC: Large values of CC reduce regularization and penalize missclassification. This leads to smaller margins in a more complex model with increased variance and reduced bias. Values of CC that are too large can lead to overfitting.

  • Smaller CC: Small values of CC increase regularization by relaxing the penalty for missclassification. This leads to larger margins in a simpler model with reduced variance but increased bias. Small values of CC are prone to underfitting.

Optimal values for CC can be found using cross-validation.

© 2018-2024 Brendan Schlaman. All rights reserved.