Overview
This posting series is a study note that records the process of learning the book “Unerstanding Deep Learning”. This time, it covers Chapter 5, Loss functions.
1. Why approach with a probability distribution?
Until the previous chapter, it was assumed that when you put input into the model, the output comes out immediately. However, the book suddenly changes its perspective and suggests that the model should compute a probability distribution . The reason for this is not explained in detail, so I looked it up.
The first reason is that if the model is made to calculate a probability distribution, it has the advantage of being able to decide on a single loss function for sure. For example, in univariate regression, least squares is used as the loss function, but instead of least squares, mse or the absolute value of the difference could be used. It is not clear which of the various cases should be used. However, if we approach it with a probability distribution, there is a reasonable reason to use a univariate normal distribution in univariate regression, and accordingly, if we find the negative log likelihood, it naturally becomes least squares. In this way, there is a mathematical reason for defining the loss function that way.
The second reason is that if you approach it with a probability distribution, you can have the model compute uncertainty, and as a result, you can know with what degree of confidence it is making the prediction. If you simply have it predict the value directly, you cannot know why it chose that value or how reasonable that value is compared to other values, but if you have it compute the variance, you can calculate this accuracy or even the reliability.
In the end, it was done with a probability distribution because of the advantage of being able to clearly determine the basis for determining the loss function according to the state of the data and the type of problem to be solved, and the ability to obtain uncertainty.
2. Cross-entropy and Dirac delta function
In section 5.7 of the book, there is a part that shows that cross entropy is actually the same as negative log likelihood. I didn't understand this part well at first because the dirac delta function suddenly appeared.

The dirac delta function is a function whose function value diverges to infinity at the corresponding x value, but whose value becomes 1 when integrated.
In the above equation, it is expressed as , but I didn't understand why this suddenly appeared.

After reading it again several times, I could see why it appeared. Cross entropy loss means a loss that minimizes the distance between probability distributions. In the end, the Kullback-Leibler divergence must be minimized. Since itself represents the distance between probability distributions, there must be two probability distributions to compare. Earlier, we changed our perspective to have the model compute a probability distribution. At this time, what we need to compare is the distribution predicted by the model and the distribution of the actual correct answer. The correct answer is not yet a probability distribution, but just a number. The dirac delta function was introduced to make this correct answer into a probability distribution. It has a probability of at the corresponding value, and the probability is 0 in the remaining sections. In the end, by obtaining through this, we can see that the cross entropy loss is the same as the negative log likelihood.
Reference
[1] Prince, S. J. D. (2023). Understanding Deep Learning. The MIT Press. Retrieved from http://udlbook.com