[UDL Study Notes] Chapter 12 - Transformers

Jzahnny

[UDL Study Notes] Chapter 12 - Transformers

Use Original Cover Image
Use Original Cover Image
Type
Post
Children
Language
en
Tags
Deep Learning
UDL
Loss Function
Residual Networks
ResNet
Uncorrelated
Authors
Jzahnny
Published

Overview

This posting series is a study note that records the process of learning the book “Understanding Deep Learning”. This time, it covers Chapter 12, Transformers.

Positional Encoding

notion image
There is a figure like this in Figure 12.5. It comes from the content that self-attention can reflect the permutation of the input through positional encoding. At this time, I knew why it was used, but I didn't understand what this graph meant. After looking into it more and thinking about it, I realized that it was actually very simple.
notion image
You can see this figure in the book, which shows the sequence of each token. Similarly, Figure 12.5 also lists the sequence of tokens so that each dimension is visible, and at this time, a pattern similar to a sine wave is applied so that permutations can be accepted differently.
 

Absolute vs Relative Position Encoding

Regarding positional encoding, there is a part that compares absolute and relative, but I didn't know exactly what it meant. After looking into it a little more, I was able to understand it at once with a simple example. Absolute position is literally based on absolute position, so it expresses the position of a word absolutely. For example, it tells the model that it is the 157th word. On the other hand, Relative can transmit the relative position from surrounding words, so it can be transmitted as -2, which is 2 before the word. That is why the relative position of two words is more important than the absolute position of a single word.
 

One-hot vector

notion image
In Figure 12.9, there is content that to create an input embedding, a matrix containing one-hot vectors is multiplied by the vocabulary embedding. At this time, I didn't know exactly what a one-hot vector meant, but after searching for it additionally, it was a simple concept. A one-hot vector is a vector in which all other entries are 0 and only one entry is 1, and by multiplying this, you can get only a specific column from the matrix. In Figure 12.9, this is used to get the necessary word embedding from the vocabulary embeddings.

Reference

[1] Prince, S. J. D. (2023). Understanding Deep Learning. The MIT Press. Retrieved from http://udlbook.com