Transformer
Encoder & Decoder stacks + Self Attention
Why Transformer
- Parallel Processing of Words (vs RNN in sequence)
- Capturing Long-Range Dependencies
- Scalability and Performance
- Improved Contextual Representation (Self Attention)
- Facilitation of Transfer Learning
Building Blocks
Core Components
- Input Embedding: Convert words into numerical vectors
- Positional Encoding: Add sequence information to input
- Multi-Head Self-Attention: Allows the model to weigh input parts for context
Encoder and Decoder Layers
- Encoders: Process the input text
- Decoders: Generate the output sequence
- Interconnected through attention mechanisms
Output Generation
- Linear Layer: Maps decoder output to a higher-dimensional space
- Softmax Layer: Converts the output into probabilities for each token
LLM Emerging Properties
- Contextual Understanding
- Zero-Shot Learning
- Complex Problem-Solving
- Artistic Creation
- Technical Coding
The Attention Mechanism
- Weighted Importance: Determines the relevance of each part of the input to the task at hand
- Contextual Awareness: Helps in understanding the context by considering the entire input sequence
- Sequence Learning: Enables the capture of relations and dependencies between elements in the sequence, regardless of their position
Self-attention is a mechanism in a neural network that allows each part of the input data, like words in a sentence, to look at other parts of the sentence to better understand the context.
How Does Self Attention Work
Step 1: Create 3 vectors from the input of the encoder (embedding), for each word Query Vector: Relevant Inbound Value Key Vector: Relevant Outbound Value Value Vector: Actual Value
Step 2: Computing attention scores, which determine how much each word in a sentence should focus on all other words. This is done using the dot product.
Step 3: Scale down the attention scores by dividing them by the square root of the dimension of the K vector
Step 4: Apply the softmax function to the scaled scores to get the attention weights, ensuring they are positive and sum up to 1.
Step 5: Multiply each V vector by its corresponding attention weight, preparing them to be summed.
Step 6: Sum the weighted V vectors to get the output for the current word, which then passes to the next layer in the network.
Multi-Head Attention
Self-attention process is split into multiple “heads”. Each head independently attend to information from different representation subspace at different positions.
- Diversified Focus
- Richer Representations
- Improved Learning
blog comments powered by Disqus