TheAnig

Back

Support vector machine decision boundaries with linear and polynomial kernels
Support vector machine decision boundaries with linear and polynomial kernels
Support vector machine decision boundaries with linear and polynomial kernels

Abstract#

Two SVMs from scratch on the same MNIST-like binary classification data. The linear SVM uses hinge loss with L2 regularization and hits 96% accuracy. The kernel SVM uses a degree-3 polynomial kernel in a dual formulation and gets 91%. The linear SVM won, which wasn’t what I expected. Grid search over learning rates and lambda values for both.

Same dataset as the logistic regression post: 784-dimensional binary MNIST data, about 6100 training samples, separate validation and test sets. This time two classifiers, both SVMs, both implemented with NumPy and no external ML libraries.

Soft-margin linear SVM#

The hinge loss#

The SVM uses hinge loss instead of logistic loss:

R(w)=1Ni=1Nmax(0,1yiwTxi)R(w) = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i \cdot w^T x_i)

Plus an L2 regularization term λ2w2\frac{\lambda}{2} \|w\|^2 added to the gradient (not to the risk computation, which only tracks the hinge loss itself).

The hinge loss is zero when yiwTxi1y_i \cdot w^T x_i \geq 1, meaning the sample is on the correct side of the margin. The “soft margin” part is that we allow some samples to violate the margin, penalizing them linearly instead of requiring perfect separation.

Subgradient descent#

The hinge loss isn’t differentiable at yiwTxi=1y_i \cdot w^T x_i = 1, so technically this is subgradient descent. For each sample where the hinge condition is violated (yiwTxi<1y_i \cdot w^T x_i < 1), we accumulate yixi-y_i \cdot x_i into the gradient:

def calculate_gradient(w, lamda):
    gradient = np.zeros((1, 784), dtype=np.float128)
    for i in range(N):
        xi = trainingData[i]
        yi = trainingLabels[i]
        if yi * np.dot(w, xi.T) < 1:
            gradient += -yi * xi
    gradient = gradient / N
    gradient += lamda * w
    return gradient
python

Classification is just the sign of wTxw^T x: positive means class +1+1, non-positive means class 1-1.

I searched over 54 combinations: 9 learning rates ×\times 6 lambda values, each run for T=500T = 500 iterations. Most of the larger learning rates produced garbage (50-90% error), but a few combinations in the low learning rate range worked well.

Best hyperparameters: learning rate = 5e-6, lambda = 0.001.

MetricTrainingValidationTest
Error3.26%3.96%3.88%
Accuracy96.74%96.04%96.12%

Linear SVM empirical risk vs iterations

Hinge loss decreasing over 500 iterations for the best hyperparameters.

Validation and test accuracy are within 0.1% of each other. The model generalizes well, and the regularization is doing its job preventing overfitting on 784 features.

Compared to logistic regression (which achieved a test risk of 0.119 but didn’t report classification accuracy directly), the linear SVM at 96.1% accuracy is a strong baseline for a linear model. Both are finding linear boundaries in 784-dimensional space; the difference is in the loss function. The hinge loss focuses on samples near the decision boundary (the support vectors), while logistic loss penalizes all misclassified samples with a smooth curve.

Soft-margin kernel SVM#

The polynomial kernel#

The kernel SVM works in a higher-dimensional feature space without computing the transformation explicitly. I used a degree-3 polynomial kernel:

K(x,y)=(xy784+1)3K(x, y) = \left(\frac{x \cdot y}{784} + 1\right)^3

The division by 784 normalizes for the feature dimension. Without it, the dot products of 784-dimensional vectors are large, and raising them to the third power makes everything explode.

The data also needed normalization before computing kernels:

trainingData_norm = (trainingData / 128.0) - 1.0
python

This maps pixel values from [0,255][0, 255] to [1,1][-1, 1]. The logistic regression and linear SVM worked on raw pixel values, but the kernel SVM is more sensitive to scale because the polynomial kernel amplifies differences.

Dual formulation#

Instead of optimizing a weight vector ww directly, the kernel SVM optimizes dual coefficients αi\alpha_i (one per training sample) and a bias bb. The prediction for a new point xx is:

f(x)=i=1NαiK(x,xi)+bf(x) = \sum_{i=1}^{N} \alpha_i \cdot K(x, x_i) + b

The gradient of α\alpha involves the precomputed kernel matrix (an N×NN \times N matrix of all pairwise kernel values between training samples). For each training point where the hinge condition is violated, we accumulate kernel-weighted gradients plus an L2 regularization term:

def calculate_gradient_a(a, k_a_b, lamda):
    gradient = np.zeros((N, 1), dtype=np.float128)
    for p in range(N):
        yp = trainingLabels[p]
        if 1 - yp * k_a_b[p] > 0:
            gradient += (-yp * k[p].T).reshape(N, 1)
    gradient = gradient / N
    gradient += (lamda / 2) * np.dot((k + k.T), a)
    return gradient
python

The bias gradient is simpler: sum yi-y_i for all hinge-violated points, divide by NN.

Grid search#

120 combinations: 12 learning rates ×\times 10 lambda values, T=500T = 500 iterations each. This took a while. The kernel matrix alone is 6107×61076107 \times 6107, and every gradient computation touches the full matrix.

Most combinations with learning rates below 1e-4 didn’t learn anything (stayed at ~46% error, the prior class distribution). The model only started learning with learning rates of 1e-3 and above.

Best hyperparameters: learning rate = 1, lambda = 1e-7.

MetricTrainingValidationTest
Error7.67%8.0%9.3%
Accuracy92.33%92.0%90.7%

Kernel SVM empirical risk vs iterations

Classification error decreasing over 500 iterations for the best kernel SVM hyperparameters.

Comparing the three classifiers#

ClassifierTest ErrorTest Accuracy
Logistic Regression(risk: 0.119)
Linear SVM3.88%96.12%
Kernel SVM9.30%90.70%

The linear SVM outperformed the kernel SVM, which wasn’t what I expected going in. The kernel SVM maps the data into a higher-dimensional space (degree-3 polynomial features), which should give it more expressive power. But more expressive power also means more room to overfit, and the kernel SVM has far more parameters (NN dual coefficients vs. 784 primal weights).

A few things probably contributed to the kernel SVM underperforming:

  1. The grid search wasn’t as thorough. Most of the 120 combinations failed to learn anything. The good region of the hyperparameter space was narrow (learning rate near 1, lambda near 1e-7), and I might have missed better combinations.

  2. 500 iterations might not be enough. The kernel SVM’s loss was still decreasing at iteration 500. The linear SVM had mostly converged by then.

  3. The data might be close to linearly separable. If the two digit classes can be separated by a hyperplane in 784 dimensions with only ~4% error, there isn’t much room for a nonlinear kernel to improve. The kernel adds complexity without enough payoff.

The linear SVM at 96% is hard to beat for the amount of effort. A simple hyperplane in pixel space, no feature engineering, no normalization needed. Sometimes the simple model wins.

Soft-Margin SVMs: Linear and Kernel
https://theanig.dev/blog/ml-hw4-svm-linear-and-kernel
Author Anirudh Ganesh
Published at February 20, 2020