//pragmatic leaders

Lesson 2.1: Transformer Architecture Deep Dive

Reading time

7 min

Section

section A-Course 2: LLM Architectures, Ethics, and Governance

7 min left0%

lesson 2.1: transformer architecture deep dive0%

7 min left

Imagine This Scenario You’re building a chatbot. The old version answers literally: “I don’t know” when asked, “Is he banking on the river bank?”. You’ve been told transformers can resolve ambiguity, but how? By the end of this lesson, you’ll understand how transformers process language creatively by “connecting dots” between words and tracking their positions. You’ll also learn why this beats older models. ---

1. Key Concepts Explained

Let’s say we have 2 words: "bank", "river" embeddings = torch.tensor([[0.8, -0.2, 0.1],

"bank" [0.1, 0.7, -0.3]])

"river"

Step 1: Create Q, K, V (use learnable weights in reality) Q = embeddings * 1.2

"What am I looking for?" K = embeddings * 0.9

"What do I know?" V = embeddings * 1.0

Step 2: Scores for "bank" (first row) vs "river" (second column) scores = torch.matmul(Q, K.T) / (3**0.5)

key size (d_k) = 3

Output: scores = [[1.1, 0.3],

[0.2, 0.8]]

Step 3: Softmax on row for "bank" weights = torch.softmax(scores[0], dim=-1)

[0.6, 0.4]

Final "bank" vector blends 60% its own V and 40% "river": output_bank = weights[0] * V[0] + weights[1] * V[1] print(output_bank)

[0.60.8 + 0.40.1, ...] ≈ [0.52, ...] ``` Why This Matters: - Variables like “Q” are trainable tools for learning relevance patterns (e.g., pronoun resolution, idioms). - Parallel computation allows processing all words at once. ---

2. Positional Encodings: The Word GPS - Concept: - Without positional encodings, transformers see words as a bag of terms (order doesn’t matter). - Encodings add position info (e.g., “dog bites man” ≠ “man bites dog”). - Analogy: - Imagine Netflix adding timestamps to subtitle frames. Even if frames are processed out of order, timestamps restore sequence. - Technical Breakdown: Option 1: Fixed (Sinusoidal) Encodings - Use math functions (sine for even positions, cosine for odd) to generate unique position "IDs." - Example Formula: \( \text{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/512}}\right) \) - \( pos \): Word position (0, 1, 2, ...). - \( i \): Dimension in embedding (0 to 255). - Intuition: - Think of it as assigning latitude/longitude to words. High \( i \) = large geographical regions (broad positions), low \( i \) = street addresses. Option 2: Learned Position Embeddings - Treat positions as vocabulary. Learn embedding for pos=0, pos=1, etc. - Example: - Position 5 → `[0.3, -0.1, 0.9]`. Code Comparison: ```python

Fixed Encoding import math def get_position_encoding(pos, dim): angle = pos / (10000 ** (2 * (dim // 2) / 512)) return math.sin(angle) if dim % 2 == 0 else math.cos(angle)

Learned Embedding (PyTorch) import torch.nn as nn position_embed = nn.Embedding(100, 512)

100 positions, 512-dim vectors positions = torch.tensor([0, 1, 2])

First 3 words position_vectors = position_embed(positions) ``` Critical Insight: - Models find it easier to localize attention when positions are encoded. ---

3. Hardware Optimization: The GPU Speed Hack - Concept: - FlashAttention reorganizes computation to minimize GPU memory reads/writes. - Analogy: - Without FlashAttention: Like a chef running to the pantry (GPU memory) for every ingredient (data chunk). - With FlashAttention: Pre-stages all ingredients in the kitchen (GPU cache) → cooks faster. - Engineer Details: - Issue: The attention matrix (N x N) grows quadratically (e.g., 10K tokens = 100M entries). - Solution: Tiling (split matrix into blocks) + recompute instead of storing intermediates. - Impact: 15x speedup for 8K-token documents (source: Dao et al., 2022). ---

4. Example System Design - Building a Netflix Subtitle Model: 1. Convert Words to Vectors: Use `BERT` embeddings (pre-trained). 2. Add Positional Encodings: Fixed for translation (generalized language rules). 3. Multi-Head Attention: Detect wordplay (e.g., puns in “The trial left him sentenced”). 4. Optimize with FlashAttention: Deploy on A100 GPUs. ---

5. Quiz: Check Your Clarity 1. Input embeddings represent words as: a) Random numbers b) Numerical vectors capturing meaning c) Single integers 2. Self-attention helps models: a) Resolve ambiguous word meanings b) Count syllables 3. FlashAttention optimizes: a) Training cost and speed b) Memory usage c) Both a and b ---

6. Homework: Context Detective Task for Non-Technical Learners: 1. Visit this interactive attention map tool. 2. Input: “The bank is next to the river bank.” 3. Observe which words “bank” attends to. Task for Engineers: 1. Install PyTorch and build a 2-head attention model for 3-word sentences: ```python import torch class SelfAttention(torch.nn.Module): def init(self, embed_size, heads): super().init() self.heads = heads self.head_dim = embed_size // heads self.Q = torch.nn.Linear(embed_size, embed_size) self.K = torch.nn.Linear(embed_size, embed_size) self.V = torch.nn.Linear(embed_size, embed_size) def forward(self, x): Q = self.Q(x) K = self.K(x) V = self.V(x)

Now split into heads and compute attention (optional) return Q, K, V

Test model = SelfAttention(embed_size=6, heads=2) inputs = torch.randn(1, 3, 6)

Batch 1, 3 words, 6-dim embeddings Q, K, V = model(inputs) print("Q shape:", Q.shape)

Should be [1, 3, 6] ``` Reflect: - How does changing the number of attention heads affect word relationships? - Could a model without positional encodings understand poetry (e.g., line breaks)? ---

Key Takeaways 1. Embeddings Encode Meaning: Words are mapped to numerical vectors (e.g., `"bank"` → `[0.8, -0.2, 0.1]`), enabling nuanced semantic understanding beyond literal dictionaries. 2. Self-Attention Solves Ambiguity: By dynamically linking words (e.g., connecting "bank" to "river" or "finance"), transformers resolve context-dependent meanings. 3. Positional Encodings Matter: Without positional data, transformers treat text as a "bag of words." Encodings (fixed or learned) restore sequence logic (e.g., "dog bites man" ≠ "man bites dog"). 4. Hardware Optimizations Scale: Techniques like FlashAttention reduce GPU memory usage by 15x, enabling efficient processing of long documents. ---

Notes - Critical Tools: - BERT Embeddings: Pre-trained vectors for initializing input embeddings. - FlashAttention: Optimizes attention computation for speed/memory. - PyTorch/NN Modules: Build custom attention heads (e.g., `SelfAttention` class). - Red Flags: - Missing positional encodings? Models fail to distinguish word order (e.g., poetry or legal clauses). - Poor GPU utilization? Implement tiling (FlashAttention) for long sequences. - Using random embeddings? Train or use pre-trained vectors for meaningful representations. ---

Alignment with Curriculum - Prior Knowledge: - Lesson 1.4 (Fine-Tuning/RAG): Transformer architecture underpins RAG’s retrieval and generation steps. - Lesson 1.5 (Monitoring): Attention maps help debug model decisions (e.g., bias detection). - Future Links: - Lesson 2.2 (Prompt Engineering): Understanding attention mechanisms improves prompt design. - Lesson 2.3 (LLM Scaling): FlashAttention principles extend to optimizing large models. - Lesson 5.3 (Hands-On Labs): Debugging transformer models using tools like `bertviz`. ---

What’s Next? In Lesson 2.2, you’ll explore model families: - Closed Models: GPT-4, Gemini Ultra. - Open Models: LLaMA 2, CodeLlama. - Specialized Models: Med-PaLM 2 for healthcare. ---

References 1. Vaswani, A., et al. (2017). Attention Is All You Need. arXiv. 2. Dao, T., et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv. 3. Wolf, T., et al. (2020). Transformers: State-of-the-Art Natural Language Processing. Hugging Face. Link. --- Ready to decode transformers like a pro? Let’s rewire your AI intuition! →