Overview
This posting series is a study note that records the process of learning the book “Unerstanding Deep Learning”. This time, it covers Chapter 4, Deep neural networks.
1. The relationship between two layers is a composite function relationship.
I had only known that the calculation through the hidden unit in each layer was simply propagated, but after looking at it in detail with mathematical formulas, I could see that it was a composite function relationship that I had often encountered in high school. When drawing the output graph according to the input, I could easily understand it by thinking according to the principle of composite functions.
2. Familiarity of General Formulation
While looking at this equation, I was confused whether each of had its own meaning, but I understood it better when I saw the generalized equation below. In fact, each of did not have its own meaning, but they all just represented parameters.
3. Size of the Weight Matrix
In the generalization formula right above, there is a content that if the $k$-th layer has hidden units, then has a size of $D_{k+1}times D_k$. I didn't understand this well at first, but when I thought about it, it was natural. The reason is that is multiplied by $mathbf{h}{K}$, and to perform matrix multiplication at this time, the size of the matrix itself must have rows, and the hidden unit obtained by multiplying through this is $D{k+1}$, so it must have columns.
4. Linear region of 2 hidden units
In Problem 4.9, there is a problem like this. The question was whether a single shallow network with two hidden units could have three linear regions as shown in the figure below. First of all, it is true that it has three linear regions because of the formula of the hidden unit and the output itself, but the question was whether a linear region that oscillates between 0 and 1 like the one below is possible. The answer sheet says it's impossible, but the reason is not clearly stated.

By manipulating it with the Interactive Figure provided in the book, I could see why.
As shown in the figure below, when there are the first hidden unit and the second hidden unit, it can be seen that the actual significant slope is used only in the active part of each unit. To oscillate between 0 and 1, you need 2 + slopes and 1 - slope, but since there are 2 hidden units, you can only make 2 slopes, and the other one is unconditionally 0.

To be more precise, if there are 2 hidden units, there are only 2 joints, so the x-axis area can be divided into 3 areas. Like [0, j1], [j1, j2], and [j2, 1]. As a result, one of the areas must have an area where all units are inactive. If it is inactive, there is no slope, so it cannot make an oscillating form.
On the other hand, if the number of units is 3 or more, at least one unit is active in all x-axis areas in any case, so you can freely define the slope in all sections.
Reference
[1] Prince, S. J. D. (2023). Understanding Deep Learning. The MIT Press. Retrieved from http://udlbook.com