The Motivation of Regularization: Overfitting

Intuition

Overfitting happens when the model performs too well on the training data, and has poor performance for unseen data

Algebraic Perspective

Think of our model as solving a linear system

  • Equations (constraints): The training data points
  • Unknowns (parameters): The weights of the model

When Number of Unknowns > Number of Equations, then the linear system has infinite solutions

Concretely, when we have more parameters than training data points, then the model can fit the training data in many ways. The way the model finally choose might perform bad in unseen data, and we call this phenomenon “overfitting”


Regularization

Introduction

Regularization adds a penalty term to the loss function that depends only on the weights . It guides how we want our model to behave:

Different Regularization Methods

L1 Regularization

Key Effect: Creates Sparse Models

  • Feature Selection: Drives many weights to exactly zero, automatically selecting the most important features
  • Interpretability: Results in simpler models that are easier to understand since irrelevant features are eliminated

L2 Regularization

Key Effect: Shrinks All Weights Smoothly

  • Proportional Shrinkage: Reduces all weights gradually rather than eliminating them completely
  • Distributed Influence: Prefers using many features with small weights rather than few features with large weights
    • Example: L2 favors over

Quick Summary

  • L1: “Pick the best features” (sparse, interpretable)
  • L2: “Use all features, but gently” (smooth, distributed)