Introduction to Transformer

Ramjee Rajasekaran
10 min readDec 25, 2020

--

This is my second post and This time around I’m excited to write about Transformer architecture. I learnt transformer architecture few months back reading through Jay Alammar’s famous post “Illustrated Transformer” and very much inspired by the same. My writing below is my understanding of transformer architecture. Please also go through my earlier post on quick view of RNN, LSTM, GRU and attention

Background

Self Attention is the core concept to understand transformer. Before we go to understand transformers and self-attention, wanted to quickly go through the earlier methods and their limitations.

Context Vector in Encoder Decoder Model

In the traditional encoder and decoder models, you have a context vector that is created at the end of encoder section and this context vector is used during every decoder steps along with output of decoder steps until t-1 time steps to predict decoder outputs at time step t. The key limitation here is reliance on single context vector to decode and generate output. They used to work well on shorter sequences but they had serious limitations when it came to longer sequences. On a lighter note, It is like you studying for my several hours on the previous night of your exam but cannot remember everything when it comes to writing and eventually write some stories based on some context :).

Attention in Neural Machine Translation Models

Attention as a concept existed in NMT models. Here, In addition to the context vector at the end of the last step of the encoder, one hidden state per time step of encoder is sent across to the decoder. To predict an output at time step t, the decoder focusses or attends to inputs at relevant time steps of the encoder along with the context vector and decoder outputs till t-1. Basically some words are more relevant than the other when it comes to decoding an output at a time step t. There is math behind how the relevance of input tokens are calculated but this concept is called attention. We will anyways see the math in detail when we discuss about self-attention in Transformer. Before we go to transformer, we need to know one key limitation of NMT i.e. drop in performance when it comes longer sequences and lacked parallelization during training.

Transformer:

Transformer was first introduced in the paper “Attention is All You Need” in 2018. To begin, let us consider Transformer as a black box that takes series of tokens as input and produces a series of tokens as output. Here we can consider transformer to be performing translation as a task. Input your sentence in English to the transformer to translate to equivalent Spanish sentence. The same can be accomplished by encoder-decoder models and NMT as well, but transformer brings in two key advancements from the earlier approaches that makes it the de facto standard for majority of NLP tasks.

  1. Self Attention: Transformer can take longer sequences while attending and retaining context better than earlier approaches.
  2. Parallelization. Transfomer inherently supports parallelization and hence much faster.

Lets now uncover the Transformer and see what’s inside.

Transformer Architecture as explained in Attention is All you Need paper.

As you can see in the Transformer architecture, there are two major blocks, the smaller block on the left is the encoder and the one on the right is the decoder.

Encoder block consists of the following: (let’s ignore the add and norm layers for now)

  1. Multi-Head Attention or self-attention
  2. Feed Forward Layer

Decoder block consists of the following:

  1. Multi-Head Attention or self-attention
  2. Multi-Head Attention or Encoder-Decoder Attention
  3. Feed Forward Layer

Encoder

Encoder from Attention is All you Need

As mentioned above, Encoder has two major layers that are Self-Attention and Feed Forward layer. In reality though there are 6 combinations of Self Attention and Feed Forward layers stacked on top of each other. Now, Let’s go through the functionality of the first block in detail. and the All the words or tokens you may call it are words represented in vector form through embedding. There are several types of word embeddings that are available for NLP tasks. A word embedding is a learned representation for text where words that are synonymous or similar meaning have a similar representation in a multi-dimensional vector space.

Input tokens with Dimension 512

The input token in embedded form is fed to the encoder. Each word is represented in 512 dimesional vector along with the positional encoding of the word. Positional Encoding are way to tell the encoder the position of the token in the input. In earlier networks architectures such as RNN which are sequential per se are already aware of the position of the word in a sentence (RNNs are inherently recurrent) but, transformers do not use recurrence and so needed a way to incorporate order of the words in a sentence. There are many ways to feed postional embeddings but the authors of the paper, suggested use combinations of sin (even positions) and cos( odd positions). These two functions provided the necessary linear propeties for the model to learn the position of tokens.

Sin and Cos functions for positional embeddings as mentioned in “Attention is All you Need” paper

So, what happens inside Self Attention Layer ?

This is the right time to introduce the concept of Self Attention. Before we go to Self-Attention, I want to give you an analogy for self attention. Say for e.g. Picture your favoruite beach with hundreds of vacationers basking under the sun, kids playing in water, surfers surfing through the waves and everybody is having fun but there is a life-guard on duty whose primary job is to save people in danger or any unlikely scenarios. The life-guard will pay more attention towards the people who are more likely to be in danger than others who are in safer places. Similarly role of self-attention layers is to ensure that each of the 512 tokens looks at every other token in the sequence and identify which ones to pay more attention than the others. Let’s spend some more time and see how the attention is calculated.

Let me introduce the formula of self attention

We need to look at the R.H.S and the major components are Q, K and V.

Q is Query Vector, K is Key and V is Value vector.

Every token in sequence will have their own sets of Q, K and V vectors. Each of these vector are formed by matrix multiplication of input embedding of the token with corressponding weight matrices of Wq, Wk and Wv. These are the weights that gets updated during training and back propagation.

Let’s break the above equation in parts.

Part 1: First — the numerator.

Query Vectors of a every token is multiplied with Key vectors of every token in the sequence. If I were to draw an analogy, you go to a street full of restaurants serving variety of cuisines. Your interest to eat a particular type of meal is Q vector and the what each of the restaurants serves is their own key vector. You will be more interested in the restaurants that serve the cuisine of your liking for that day. Let’s say you want to eat Italian and You will be more likely to eat in the restaurant that serves pastas and pizzas :). Your Q values are likely to provide higher attention to the restaurants that serve Italian cuisine’s K vectors.

Conceptual multiplication of Q and K values of each of the tokens if your input Sequence is “Do you like Sandwiches”.T he values in the second table are multiplcation of Q and K values of every token

Part 2: Denominator — Scaling the Q*K Values down

Here the dot product of Q and K are scaled down by number 8. Number eight is used here as every token though of length 512 is broken down into eigth pieces each of length 64. dk is 64 and square root of 64 is 8. This is an architectural choice made by the authors.

Divide by suqare root of 64 which is 8

Part 3: Softmax (Part1 and Part 2)

Softmax is applied to supress lower values while heighten the higher values. Essentially to keep the values between 0 and 1.

softmax

Part 4: Multiply with Value Vector

The values are then multiplied with value vectors to get the self attention for each of the tokens.

If embedding of a token is of length 512, they are broken down into 8 equal pieces each of lendth 64. self attention is calculated for each of the 8 pieces with their own trained Q, K and V Vectors. This is called multi-head self attention. This improves model’s performance by improving the representation of token subspaces.

Image from Attention is All you Need Paper Showing Input Embedding is copied 3 times as Q K and V vectors and passed ajd describing the process of self attention calculation.

Image Source — Attention is All you Need. Describing the processing of self attention flow

Residual Connection, Layer Norm and Feed forward Layer:

The output of self attention are concatenated and added to the positional embeddings as residual connections and fed to Layer Normalization and to feed forward layer. The further enrich the respresentative capacity of the input while stabilizing the network and reducing the training time.

So, as we just saw, Self Attention + Feed forward layer together formed 1 encoder block. You can stack as many encoder blocks as you want. In the “Attention Is All You Need” paper the authors used 6 such blocks.

Parallelization:

The Transformer encoder allows entire input sequence to be processed in parallel without any dependencies. Though there are dependencies between paths in self-attention layer but still can be executed in parallel. There are no such dependencies within the feed forward layer.

Image Source: Attention Is All You Need paper. The bolder the green lines higher the attention and lighter the green lines lower the attention.

In short the context unaware and context independent trained word embeddings goes through the encoder and comes out as context aware vectors. Each word in the input tokens knows how much attention or focus should be on the every other word in the input sequence.

Decoder

As we saw earlier, The Decoder block consists of following 3 blocks

  1. Multi-Head Attention or self-attention
  2. Multi-Head Attention or Encoder-Decoder Attention
  3. Feed Forward Layer

The functioning of 1st and 3rd block are same as that of the self attention and feed forward layer of the encoders with one exception that at every time step, the decoder will not have access to any of the future tokens of target sequence. The Decoder is regressive in nature. The second block within the decoder is encoder-decoder attention layer. The Encoder-Decoder attention layer the Q vector is taken from Decoder’s self attention layer and K and V vectors are taken from output of Encoder layer. This is one of the most crucial steps as the Decoder learns to attend which word of input sequence attends to which token in the target sequence.

Represention of flow between Encoder to Decoder and Decoder Steps.

At every timestep, following activities happen and the entire sequence is repeated in a regressive manner.

a. The input to the decoder is the target token passed one token at a step. The token goes through the Decoder self-attention layer without access to the future tokens and output of the decoder self-attention layer goes through encoder-decoder attention.

b. Inside the Encoder-Decoder attention layer, the attention K and V vectors from encoder output are used with Q vectors of the Decoder attention block. This is to ensure that the Decoder learns to build attention between input sequence and output sequence.

c. The output from Encoder-Decoder attention is passed on to the residuals, layer norm and feed forward layers just as the case in Encoder Block.

As per the architecture, there are six such decoder block stakced one on top of each other.

d. At the end there is a Linear and Softmax layer to select the right output from the vocabulary. If our vocbulary is 30K long, the output should select one of those 30K tokens based on probability.

Training and Loss Function

During the training phase, the model prediction at every time step will be compared with the Ground truth words of the labelled data. Based on the comparison the loss is calculated using loss function and necessary gradients are back propagates to update the model weights.

This is a very simplified description of Transformer architecture. Thanks for your time and please leave your comments below. I’m planning to write more on Transformer variants such as BERT, GPT, reformer, etc. please stay connected.

Special Thanks to following sources for their detailed content. Definitely recommend that you refer them for more detailed understanding

The Illustrated Transformer — Jay Alammer, Illustrated Guide to Transformers Neural Network: A step by step explanation by Michael Phi.

--

--

Ramjee Rajasekaran
Ramjee Rajasekaran

Written by Ramjee Rajasekaran

NLP Engineer, AI ML practitioner, Problem Solver

Responses (1)