[UDL Study Notes] Ch 9 - Regularization

Overview

This post series is a study note that records the process of learning the book "Understanding Deep Learning".

This time, I will cover Chapter 9, Regularization.

https://udlbook.github.io/udlbook/

1. The Need for Regularization

In Chapter 8, we discussed methods for measuring model performance and saw that a performance gap between training and test data can occur.

This happens because the model may memorize the noise in the training data (overfitting) or make nonsensical predictions in regions where no training data exists.

Regularization is a method to reduce this gap and improve generalization performance. It does so by adding terms to the loss function (explicit regularization), through effects induced by the learning algorithm itself (implicit regularization), and through various heuristic methods.

2. Explicit Regularization

The most basic regularization method is to add a penalty term to the loss function.

For example, L2 regularization (Weight Decay) constrains the weights so they don’t become excessively large, while L1 regularization encourages sparsity.

Mathematically, this can be interpreted as MAP (Maximum A Posteriori) estimation, which reflects the prior beliefs we have about the parameters.

3. Implicit Regularization

An interesting point is that SGD itself produces a regularization effect.

Unlike continuous Gradient Descent, the discrete step size causes the algorithm to prefer certain solutions, and this is called implicit regularization.

In particular, when training with smaller batches, the algorithm prefers solutions where the gradient differences across batches are smaller, which explains why generalization often improves in such cases.

4. Heuristics

In addition to explicit and implicit regularization, several empirical heuristic techniques were introduced that help improve generalization:

Early Stopping: Stop training before convergence to prevent overfitting.

Ensembling: Average or vote over multiple models’ outputs to improve performance.

Dropout: Randomly turn off some neurons to reduce dependency on specific units.

Adding Noise: Add noise to inputs, weights, or labels to train a more robust model.

Bayesian Approach: Treat parameters as random variables and account for uncertainty (though approximations are needed).

Transfer Learning / Multi-task Learning: Transfer knowledge from other datasets or related tasks.

Self-supervised Learning: Generate pseudo-labels from unlabeled data for pretraining.

Ultimately, these various methods can be grouped into four principles: make the model smoother, increase the amount of data, combine multiple models, and converge to wider minima.

Thoughts

Regularization is not just a technique of adding terms to the loss function. It also includes:

Characteristics of the learning algorithm itself (implicit regularization)

The point at which training is stopped (early stopping)

How the model architecture and data are utilized (transfer learning, self-supervised learning)

I realized that regularization is a comprehensive strategy to protect performance on unseen data, not just on training data. This makes it a crucial topic to always consider when running real-world projects.

Reference

[1] Prince, S. J. D. (2023). Understanding Deep Learning. The MIT Press. Retrieved from http://udlbook.com