Paper·arxiv.org
transformerarchitecturefoundationalattentionnlpllm
Attention Is All You Need
Learn the foundational Transformer architecture introduced in "Attention Is All You Need." This pack distills the core concept of self-attention, enabling you to grasp how modern LLMs process sequences efficiently without recurrence or convolutions.
intermediate30 min6 steps
The play
- Understand Self-Attention's RoleRecognize that self-attention allows a model to weigh the importance of different words in an input sequence when processing each word, establishing relationships within the sequence itself.
- Define Query, Key, ValueConceptualize Query (Q), Key (K), and Value (V) vectors. Q asks 'what am I looking for?', K answers 'what do I have?', and V provides 'what information do I give if matched?'.
- Calculate Raw Attention ScoresCompute the dot product between the Query and Key matrices (Q * K^T). This yields a matrix where each entry indicates the 'compatibility' or 'relevance' between a query item and a key item.
- Scale and Normalize ScoresDivide the raw scores by the square root of the key's dimension (dk) to stabilize gradients, then apply a softmax function row-wise to obtain attention weights, ensuring they sum to 1.
- Compute Weighted SumMultiply the attention weights matrix by the Value matrix. Each row in the output represents a weighted sum of the Value vectors, with weights determined by the attention scores.
- Implement Scaled Dot-Product AttentionWrite a Python function using NumPy to perform the scaled dot-product attention mechanism from scratch.
Starter code
import numpy as np
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Computes scaled dot-product attention.
Args:
Q (np.ndarray): Query matrix (batch_size, seq_len_q, d_k).
K (np.ndarray): Key matrix (batch_size, seq_len_k, d_k).
V (np.ndarray): Value matrix (batch_size, seq_len_v, d_v).
mask (np.ndarray, optional): Mask for attention scores. Defaults to None.
Returns:
np.ndarray: Output matrix (batch_size, seq_len_q, d_v).
np.ndarray: Attention weights (batch_size, seq_len_q, seq_len_k).
"""
d_k = Q.shape[-1]
scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)
if mask is not None:
scores = scores + (mask * -1e9) # Apply mask by setting masked values to a very small number
attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
output = np.matmul(attention_weights, V)
return output, attention_weights
# Example usage:
# batch_size, seq_len, d_model = 1, 3, 4
# Q = np.random.rand(batch_size, seq_len, d_model)
# K = np.random.rand(batch_size, seq_len, d_model)
# V = np.random.rand(batch_size, seq_len, d_model)
# output, weights = scaled_dot_product_attention(Q, K, V)
# print("Output shape:", output.shape)
# print("Weights shape:", weights.shape)Source