[UDL Study Notes] Ch 7 - Gradients and initialization

Overview

This post series is a study note that records the process of learning the book "Understanding Deep Learning".

This time, I will cover Chapter 7, Gradients and initialization.

https://udlbook.github.io/udlbook/

1. Backward pass

In Chapter 7, in order to find the gradient, we deal with the backward pass in the opposite direction, not the forward pass, which is the inference process of a general model.

It was very surprising at first to hear the idea of finding the gradient by going backward from the loss and finding the partial derivative to adjust each parameter.

And I wondered how to differentiate the data of the parameters used for actual computing, which are discrete values, but I learned that most linear deep learning functions can be differentiated because the slope comes out as a constant when differentiated.

In addition, the partial derivatives of each layer are obtained step by step from loss to input, and the partial derivatives of each parameter are derived from the layer, so I could think of a tree data structure while looking at this structure.

2. Matrix Calculus

Looking at the formulas used in the backward pass or the problems in the second half of Chapter 7, I often saw notations for partial differentiation with vectors or matrices, which was difficult to understand while reading the book.

So I looked for information on this and found another field called Matrix Calculus.

Matrix Calculus is a field that presents methods for expressing various differentiations in Multivariable Calculus.

The definition for each differential notation is as follows:

While reading Chapter 7, I was often confused by the notation of differentiating a vector with a vector, but I was able to find out that it was actually the Jacobian matrix of the two vectors.

3. Variance of initialization

When initializing the parameters of a model, I initially thought it would be okay to just set them randomly, but then I learned that vanishing gradient problem or exploding gradient problem occurs.

When I first encountered these problems while following the book, I was able to infer that the variance should be set to an appropriate value because problems arise if it is too small or too large.

In addition, from the explanation that the ReLU function clips half of the value range of the previous layer, causing the gradient to vanish or diverge, I was able to infer that even if ReLU clips half, the problem can be solved by setting the variance so that the range of the previous layer and the next layer is the same.

I was also able to confirm that the actual content is that if the dimensions of the front and back layers are the same, the variance can be set to through He initialization to make the variance of the front and back layers the same.

4. Example code and framework

In the second half of Chapter 7, an example code for learning a model by synthesizing the contents of the previous sections appears.

Previously, when I saw this PyTorch code, I thought it was a black box code because I didn't know what each line did, but after reading this book up to Chapter 7, I was able to roughly guess what each line would do.

It was also amazing that modern deep learning frameworks such as PyTorch or TensorFlow can perform analytical differentiation despite being numerical computers.

I could see the convenience of the framework in that the difficult partial differentiations that appeared in Chapter 7 can be calculated with a single function loss.backward() in PyTorch.

And since I learned the number of parameters that go into one model and the amount of calculation required to find the partial derivative of the parameters, I was able to think about why the computing resources required for a deep learning model are so large.

Reference

[1] Prince, S. J. D. (2023). Understanding Deep Learning. The MIT Press. Retrieved from http://udlbook.com