Linear SVM
Linear SVM (support vector machine) is a supervised machine learning algorithm used for classification. The algorithm learns a maximal margin hyperplane separating the classes of data. In simpler terms, we first assume that two classes of data are separable in the feature space. Then, the algorithm learns a dividing straight-edge boundary between the two classes such that the distance between either classes of data to the separating boundary is the highest.
The separating boundary is called a hyperplane. The hyperplane in 2D space is a line. The hyperplane in 3D space is a plane.

The nearest data points from the separating/decision boundary to each class are called the support vectors. The distance from the separating boundary to the support vectors is the margin. In short, an SVM is a maximal margin classifier. Maximizing the margin makes for better generalization on unseen data.

Linear SVMs are fast and are thus used in high dimensional data classification problems. SVMs are used in text classification after the text is pre-processed into a bag of words or embeddings. SVMs are also used in optical character recognition (OCR) for handwritten text.
By the end of this lesson, you'll have a better idea of how to answer the following commonly asked questions:
- What is the margin in an SVM and why is it important?
- What are the trade-offs in the C parameter in a soft-margin SVM?
- What are the primal and dual formulations of linear SVMs?
- How would you handle class imbalance in SVMs?
- What are the challenges of applying online learning to an SVM?
We'll provide the answers at the end of this lesson.
Overview
Equations
Setup:
We are given training data vectors and their labels . During training, the goal is to find and such that has the same sign as for the most number of samples.
Note that is the size of the feature vector for each data point (dimension of the feature space).
Objective function:
The objective is to minimize subject to the constraint for all .
Geometrically speaking, this optimization maximizes the margin (by minimizing ) while incurring a penalty when a sample is misclassified or within the margin boundary.
Combining the objective and the constraint yields the optimization function in the primal form:
Where
- : Feature vector
- : The label
- : Weight vector
- : bias term
- : Regularization factor
- : Number of observations
When data is separable into two non-overlapping groups via a separating straight edge, in the feature space, we call it the hard margin scenario. But when they are overlapping, we have a soft margin scenario. Soft margin scenario is more general and subsumes the hard margin scenario.
Geometric intuition
Consider a two-dimensional feature space. In the following plot we show the two classes of data points - red data points belong to class 1 and blue data points belong to class -1 (see the colorbar on the edge of the plot).

In this soft margin scenario we see that the data points at (3,3) (red) and (6,6) (blue) are the support vectors. They are annotated with an orange circular border. Whereas the data points at (4,4) (blue) and (6,5) (red) are margin violators. They are annotated with a rectangular purple border around them. One can see that they violate the decision boundary and fall on the wrong side of the boundary line in orange. The weight vector (also called normal vector) uniquely identifies the boundary line - weight vector is orthogonal (perpendicular) to the boundary line.
Linear SVM learns a boundary line such that the distance of the boundary is maximal from either class. As such, SVMs can be sensitive to outliers near the margin or support vectors. These outliers can significantly influence the position of the decision boundary.
Primal Problem:
This optimization problem can be solved iteratively using stochastic gradient descent (SGD), when there are a large number of datapoints, the feature space is huge, or online learning of weight vectors is desirable.
Dual Problem:
More commonly however, this problem is converted into its dual form which can be solved using quadratic programming when the number of datapoints is limited. The dual problem corresponding to the above primal problem is:
subject to and .
An intuitive way to see this dual formulation is by thinking about the constraint. When all of the $\alpha$s are zero, then the constraint is satisfied. However, if one $\alpha_i \gt 0, for $y=+1, then there must exist a compensating for . In fact, these non-zero s correspond to the support vectors and can be used to compute . Note that the quadratic term in the dual problem’s objective yields a quadratic computational time complexity with respect to the number of datapoints.
A few key points to note about the dual problem include:
- Dual problem formulation allows for the use of kernel trick, enabling non-linear classification without explicitly mapping each data point to a higher dimensional space, but only computing their inner product in the higher dimensional space based on a kernel function in the lower dimensional feature space
- Dual problem leads to sparse solutions, since it attempts to force most of the to be zero
- There exist sequential minimal optimization techniques that facilitate dual computations
Algorithm
We will be learning the weight vector and bias term iteratively using stochastic gradient descent (SGD) for the primal form.
- Select an initial value for the adaptive learning rate e.g.
- Initialize weight vector $w$ and bias term to small random values
- For each epoch (processing all the $n$ input records once), compute
- If this value is , we update the weight term (this avoids overfitting)
- If the value is , then we update the weight term and the bias term
- At the end of the epoch, adjust the learning rate according to the learning schedule using the decay rate (), typically an exponential reduction
- With the updated learning rate, perform another epoch of weight adjustment. At the start of each epoch shuffle the training data in order to alleviate bias in weight updates due to the ordering of input data.
- Repeat until the desired number of epochs have been completed or the weights have converged (according to previously defined criteria). Typical convergence criteria include magnitude of weight vector update falling below a threshold, and magnitude of objective function falling within a pre-established numerical value.
- For large datasets consider creating mini-batches so the updates are not performed one datapoint at a time
Serving
After performing the learning steps above, for any new datapoint, , we can compute (). The sign of this scalar is the predicted label for this datapoint. Referring back to the earlier plot, enhanced with new test data points, shown as asterisks below, the trained SVM will classify them as class -1, but with varying certainty based on their decision values as shown here:
Test point 1 at [5. 4.5]: Predicted class -1, Decision value -0.0417 Test point 2 at [5. 5.5]: Predicted class -1, Decision value -0.6250 Test point 3 at [4.5 5. ]: Predicted class -1, Decision value -0.2917 Test point 4 at [5.5 5. ]: Predicted class -1, Decision value -0.3750

Pseudocode
From the outlined algorithm, we can construct a pseudocode for linear SVM as follows:
Pseudocodefunction TrainLinearSVM(X, y, C, initial_learning_rate, decay_rate, max_epochs, convergence_threshold, batch_size): // Initialize parameters n_features = number of features in X n_samples = number of samples in X w = small random values of size n_features b = small random value learning_rate = initial_learning_rate epoch = 0 converged = false while epoch < max_epochs and not converged: // Shuffle the training data shuffle(X, y) // Create mini-batches mini_batches = create_mini_batches(X, y, batch_size) for mini_batch in mini_batches: for (xi, yi) in mini_batch: // Compute decision function decision = yi * (dot_product(w, xi) + b) if decision >= 1: // Update only w (weight decay) w = w - learning_rate * w else: // Update both w and b w = w - learning_rate * (w - C * yi * xi) b = b + learning_rate * C * yi // Update learning rate learning_rate = initial_learning_rate * exp(-decay_rate * epoch) // Check for convergence if check_convergence(w, b, X, y, C, convergence_threshold): converged = true epoch = epoch + 1 return w, b function check_convergence(w, b, X, y, C, threshold): // Check if objective change is below threshold if abs(objective - previous_objective) < threshold: return true else: previous_objective = objective return false function create_mini_batches(X, y, batch_size): // Create mini-batches of specified size // Return list of (X_batch, y_batch) tuples // Main program X, y = load_data() C = 1.0 // Regularization parameter initial_learning_rate = 0.1 decay_rate = 0.01 max_epochs = 1000 convergence_threshold = 1e-5 batch_size = 32 w, b = TrainLinearSVM(X, y, C, initial_learning_rate, decay_rate, max_epochs, convergence_threshold, batch_size)
Evaluation
As in most other machine learning techniques, Linear SVM’s goodness of fit is measures using a validation data set:
- Pick an initial set of hyperparameters and learn an SVM
- Validate the results on validation dataset
- Repeat step 1 and 2 with varying hyperparameters using a randomized search or a grid based search. Select the best performing model on the validation dataset as measured by precision recall or ROC curves of F1 scores, depending on the specific needs of the problem domain. This is for offline evaluation and selection of model for productionisation
- A standard approach for online evaluation is to use A/B testing, where the current linear SVM is pitched against a baseline or previous model on a small portion of users/data.
You may read more about supervised learning evaluation metrics here.
Limitations
Limitation 1: Inability to deal with non-linearly separable data
Mitigation: Use non-linear SVMs / use the kernel trick and perform a dual form based learning

Limitation 2: Does not produce probabilistic outputs
Mitigation: Use calibration methods or use other ML algorithms such as logistic regression
Employing linear SVM
Pythonimport numpy as np import matplotlib.pyplot as plt from sklearn.svm import SVC # Create a small 2D dataset X = np.array([ [1, 2], [2, 3], [3, 3], [2, 1], [3, 2], # Class 1 [6, 6], [7, 8], [8, 7], [7, 6], [8, 6], # Class 2 [6, 5], [4, 4], # Points within the margin or wrong side ]) y = np.array([1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, -1]) # Train a Linear SVM with soft margin svc = SVC(kernel='linear', C=0.5) svc.fit(X, y) # Extract weight vector (w) and bias (b) w = svc.coef_[0] b = svc.intercept_[0] # Calculate decision function values for all points y_decision = svc.decision_function(X) # Identify margin violations (points strictly within the margin) margin_violations = np.abs(y_decision) < 1 - 1e-10 # Create new test points X_test = np.array([[5, 4.5], [5, 5.5], [4.5, 5], [5.5, 5]]) y_test_pred = svc.predict(X_test) y_test_decision = svc.decision_function(X_test) # Print test point classifications for i, (x, y_pred, y_dec) in enumerate(zip(X_test, y_test_pred, y_test_decision)): print(f"Test point {i+1} at {x}: Predicted class {y_pred}, Decision value {y_dec:.4f}")
Common questions
Q (Junior Engineer): What is the margin in an SVM and why is it important?
A: Margin is the distance between the separating boundary and the nearest datapoint of either class in an SVM. The larger the margin the more robust the linear SVM classifier
Q (Mid-level Engineer): What are the trade-offs in the C parameter in a soft-margin SVM?
A: The C parameter (a hyperparameter) in an SVM balances between maximizing the margin and minimizing the classification errors. Larger C values enforce stricter classifications and possibly narrower margins (between the classes).
Q (Mid-level Engineer): Discuss primal and dual formulations of linear SVMs
A: Primal formulation optimizes in the space of weight vectors and biases. In the dual formulation, the optimization is performed over the Lagrange multipliers (s in this article). The primal form is preferred for large datasets. However the dual form is preferred in scenarios where non-linearity and kernel methods need to be applied.
Q (Senior Engineer): How would you handle class imbalance in SVMs?
A: Classic techniques such as oversampling or undersampling can always be applied in such scenarios. In addition to those, class weightages can also be applied. An additional technique is to adjust the boundary away from the majority class in proportion to its representation proportion.
Q (Senior Engineer): What are the challenges of applying online learning to an SVM?
A: Stochastic gradient descent (SGD) can be applied for online learning for an SVM. The challenges include selecting a learning rate for weight adjustment, continued model validation, potential concept drift and adjusting for those. There is always the tension between model stability and adaptability. If an SVM model is utilized for online serving, it is essential to perform rigorous online evaluation (A/B testing) before deployment.
Extra: SVM vs. logistic regression
SVMs & logistic regression models are often compared because they are both widely used for binary classification. Here are some reasons you would use one over the other:
- With respect to Outlier Robustness, SVM is generally more robust to outliers compared to logistic regression. SVM focuses on support vectors, which define the decision boundary, largely ignoring data points far from the margin. Logistic regression considers all data points in its optimization, making it more susceptible to outliers across the entire dataset.
- SVMs can be more sample efficient once the support vectors are determined, as new samples far from the margin have little impact on the model. However, SVMs still need a representative sample to accurately learn the support vectors initially.
- Both SVMs and logistic regression can be sensitive to features with widely varying magnitudes; and so feature scaling is important for both algorithms.
- SVMs are generally more tolerant of multicollinear features compared to logistic regression. Logistic regression can suffer from instability and interpretability issues in such cases.
- Both models can struggle with concept drift, where the underlying data distribution changes over time. Regular retraining or online learning variants are necessary to adapt to significant changes in data patterns.
- Logistic regression naturally provides probabilistic outputs. SVMs don't directly provide probabilities, though calibration methods (like Platt scaling) can be used to convert SVM outputs to probabilities.
- Logistic regression coefficients are more interpretable in terms of feature importance. SVM decisions, especially with non-linear kernels, are less interpretable.