Transformer Learning Notes

Posted by Tianle on July 14, 2025

Transformer

Transformer Architecture Encoder & Decoder stacks + Self Attention

Why Transformer

  • Parallel Processing of Words (vs RNN in sequence)
  • Capturing Long-Range Dependencies
  • Scalability and Performance
  • Improved Contextual Representation (Self Attention)
  • Facilitation of Transfer Learning

Building Blocks

Core Components

  • Input Embedding: Convert words into numerical vectors
  • Positional Encoding: Add sequence information to input
  • Multi-Head Self-Attention: Allows the model to weigh input parts for context

Encoder and Decoder Layers

  • Encoders: Process the input text
  • Decoders: Generate the output sequence
  • Interconnected through attention mechanisms

Output Generation

  • Linear Layer: Maps decoder output to a higher-dimensional space
  • Softmax Layer: Converts the output into probabilities for each token

LLM Emerging Properties

  • Contextual Understanding
  • Zero-Shot Learning
  • Complex Problem-Solving
  • Artistic Creation
  • Technical Coding

The Attention Mechanism

  • Weighted Importance: Determines the relevance of each part of the input to the task at hand
  • Contextual Awareness: Helps in understanding the context by considering the entire input sequence
  • Sequence Learning: Enables the capture of relations and dependencies between elements in the sequence, regardless of their position

Self-attention is a mechanism in a neural network that allows each part of the input data, like words in a sentence, to look at other parts of the sentence to better understand the context.

How Does Self Attention Work

Step 1: Create 3 vectors from the input of the encoder (embedding), for each word Query Vector: Relevant Inbound Value Key Vector: Relevant Outbound Value Value Vector: Actual Value

Step 2: Computing attention scores, which determine how much each word in a sentence should focus on all other words. This is done using the dot product.

Step 3: Scale down the attention scores by dividing them by the square root of the dimension of the K vector

Step 4: Apply the softmax function to the scaled scores to get the attention weights, ensuring they are positive and sum up to 1.

Step 5: Multiply each V vector by its corresponding attention weight, preparing them to be summed.

Step 6: Sum the weighted V vectors to get the output for the current word, which then passes to the next layer in the network.

Multi-Head Attention

Self-attention process is split into multiple “heads”. Each head independently attend to information from different representation subspace at different positions.

  • Diversified Focus
  • Richer Representations
  • Improved Learning

blog comments powered by Disqus