Support Vector Machine Basics

Published November 09, 2023

Hyperplane interpretation
Regularization parameter

Assembled from SVM notes and CQF Exam 3 Question 2

Hyperplane interpretation

Support Vector Machines are best understood through analysis of a hyperplane in standard form:

\boldsymbol{\theta}^T \mathbf{x} + \theta_0 = 0

Varying $\theta_{j > 0}$ has the effect of rotating the hyperplane about the $\theta_j = 0$ intercept.
Varying $\theta_0$ has the effect of translating the hyperplane in relation to the origin.
Points to one side (in 2D, "above", w.l.o.g.) the hyperplane are those which satisfy $\boldsymbol{\theta}^T \mathbf{x} + \theta_0 > 0$
Equidistant parallel hyperplanes which form a margin can be represented as
$\boldsymbol{\theta}^T \mathbf{x} + \theta_0 = \pm 1$
Critical to understanding: Scaling the left side of the equation has the effect of expanding and contracting this margin. This means that the constraint $y_i (\boldsymbol{\theta} \cdot \mathbf{x}_i + b) \geq 1 \; \forall i$ can always be satisfied for linearly separable data!

The width of the margin of the SVM is given by $\frac{2}{\| \boldsymbol{\theta} \|_2}$ , so SVM maximization of the margin is given by

\begin{aligned} &\min_{\boldsymbol{\theta}, \theta_0} \frac{1}{2} \|\boldsymbol{\theta}\|_2^2 \\ \text{subject to} \quad &y_i (\boldsymbol{\theta} \cdot \mathbf{x}_i + b) \geq 1 \quad \forall i \end{aligned}

For soft margins, the cost function is

J = \frac{1}{N} \sum_{n=1}^N \max \left( 1 - y^{(n)} \left( \boldsymbol{\theta}^T \mathbf{x}^{(n)} + \theta_0 \right), 0 \right) + \lambda \|\boldsymbol{\theta}\|_2^2

where the regularization parameter $\lambda$ determines the relative weight relationship between minimization of margin (the regularization term) and missclassifications.

Regularization parameter

The regularization parameter in a Support Vector Machine is a hyperparameter that determines the relative weight of the squared $L2$ regularization term in the loss function.

In the popular library sklearn, $C$ is the weight applied to missclassification, so $C$ is strictly positive and is inversely proportional to the strength of regularization.

The objective function of a standard SVM with a linear kernel, for instance, can be written as:

\begin{aligned} &\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^n \xi_i \\ \text{subject to} \quad &y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0, \quad \forall i \end{aligned}

Regularization induces a penalty for model complexity. In the context of SVMs, the regularization term $\frac{1}{2} \|\mathbf{w}\|^2$ promotes model simplicity, i.e. large margins. In the context of soft margins SVMs, larger margins reduce variance and increase bias. Therefore, varying $C$ has the following effect on the model’s bias/variance trade-off:

Larger $C$ : Large values of $C$ reduce regularization and penalize missclassification. This leads to smaller margins in a more complex model with increased variance and reduced bias. Values of $C$ that are too large can lead to overfitting.
Smaller $C$ : Small values of $C$ increase regularization by relaxing the penalty for missclassification. This leads to larger margins in a simpler model with reduced variance but increased bias. Small values of $C$ are prone to underfitting.

Optimal values for $C$ can be found using cross-validation.