Transformer Model: Self Attention - Implementation with In-Depth Details
Medium Post - [ Ссылка ]
In this tutorial, we'll implement the self attention module using Pytorch as discussed in the previous video.
Specifically, we'll implement the following steps-
1. Create Query, Key, and Value using input vectors.
2. Compute attention scores using Query and Key (transpose).
3. Convert attention scores to probability distribution using SoftMax.
4. Compute weighted values by multiplying attention scores to corresponding values.
[Q₁ x K₁] * V₁, [Q₁ x K₂] * V₂ … [Q₁ x Kₙ] * Vₙ
[Q₂ x K₁] * V₁, [Q₂ x K₂] * V₂ … [Q₂ x Kₙ] * Vₙ
…
[Qₙ x K₁] * V₁, [Qₙ x K₂] * V₂ … [Qₙ x Kₙ] * Vₙ
Where "x" is the dot product and "*" is the pointwise matrix multiplication. Also, Qₙ is defined as-
Q = [
[0, 1, 1], # Q₁
[4, 6, 0], # Q₂
[2, 3, 1], # Q₃
]
Similarly, Vₙ is a row of Value matrix, and Kₙ is the column of Key Matrix.
5. Add-up the weighted values, computed using the scores of a particular query.
[Q₁ x K₁] * V₁+ [Q₁ x K₂] * V₂ … + [Q₁ x Kₙ] * Vₙ (R₁)
[Q₂ x K₁] * V₁+ [Q₂ x K₂] * V₂ … + [Q₂ x Kₙ] * Vₙ (R₂)
…
[Qₙ x K₁] * V₁+ [Qₙ x K₂] * V₂ … + [Qₙ x Kₙ]* Vₙ (Rₙ)
To make it easier to understand, we detailed the output of all these steps.
The current implementation is only for a single input. In the next video, we'll extend it for a batched input. Stay Tuned.
The code used in this tutorial is available here-
[ Ссылка ]
Tutorial on Indexing and Slicing-
[ Ссылка ]
Chapters-
0:00 - Background
0:09 - 5 steps of self attention implementation
2:17 - Implement __init__ method of self attention class
5:09 - Implement forward method of self attention class - compute query, key and value
6:12 - Compute attention scores
7:24 - Convert attention scores to a probability distributions
8:24 - Compute weighted values
17:11 - Compute output
18:13 - Update the weights of linear layer for query, key and value and verify the output
20:24 - Next video
#pytorch #tutorial #self #attention #transformer #selfattention
Ещё видео!