[UDL Study Notes] Ch 3 - Shallow neural networks

Overview

This post series is a study note recording the process of learning the book “Understanding Deep Learning”.

This time, it covers Chapter 3, Shallow neural networks.

https://udlbook.github.io/udlbook/

1. Linear region

I initially misunderstood the concept of a “Linear region” from Chapter 3 when I first encountered it, but I have now learned it correctly.

When there is a Shallow neural network model like the one above, I initially misunderstood the statement that the maximum number of Linear regions in this model is 4, thinking, "Since there are 4 linear equations, there are 4 Linear regions."

However, after examining various contexts and graphs, I realized that the Linear region is not directly related to the equations but appears geometrically.

The newly understood Linear region is an area where the model has linearity, and the number of Linear regions means the total number of linear areas divided by several boundaries.

Indeed, in one dimension, the Linear region would be straight lines separated by a few points, but in two dimensions, the Linear region consists of planes separated by a few lines.

In the 2D model above, you can see that there are a total of 7 Linear regions appearing as planes.

2. Matrix Notation

Looking at the generalized Shallow neural network model's equation, I thought it would not be easy to understand it systematically.

Also, since it is very similar to a matrix equation, I wanted to organize it as a matrix.

The equation for the shallow neural network model organized into a matrix is as follows:

3. Activation functions and ReLU

At the end of Chapter 3, it introduced various activation functions and told the history of their development.

I learned that in the very beginning, ReLU was used as an activation function, then there was a trend of using logistic or tanh functions, and later, ReLU was re-highlighted in relation to learning efficiency.

I learned that ReLU has the advantage of being very efficient in computation during the learning process because its derivative is simple, but it also has a problem called "dying ReLU" where learning does not proceed for negative values because the derivative is 0.

To solve this, I also learned that derivative functions of ReLU such as Leaky ReLU and Parametric ReLU, which retain the simplicity of differentiation, and functions from the smooth function family such as softplus and Swish were developed.

Personally, in other deep learning documents I have seen so far, ReLU was mostly used as the activation function, but I was able to think about why ReLU is used and what other alternatives there are.

4. Linear and Affine

The form that often appears in Neural networks looks familiar and linear, but it is not actually linear.

A linear transform must satisfy , but the above equation does not satisfy this.

For example, if we assume , then .

Therefore, these equations are not linear and are called Affine instead.

Affine looks like a linear transformation with a parallel shift, but it is still a distinct concept from linear.

However, in this UDL book, it is promised to call both Affine and Linear just Linear, following the convention in the machine learning field.

This can be taken as an intention to differentiate between nonlinear and linear equations.

Reference

[1] Prince, S. J. D. (2023). Understanding Deep Learning. The MIT Press. Retrieved from http://udlbook.com