Attention Is All You Need

Learn the foundational Transformer architecture introduced in "Attention Is All You Need." This pack distills the core concept of self-attention, enabling you to grasp how modern LLMs process sequences efficiently without recurrence or convolutions.

intermediate30 min6 steps

The play

Understand Self-Attention's Role
Recognize that self-attention allows a model to weigh the importance of different words in an input sequence when processing each word, establishing relationships within the sequence itself.
Define Query, Key, Value
Conceptualize Query (Q), Key (K), and Value (V) vectors. Q asks 'what am I looking for?', K answers 'what do I have?', and V provides 'what information do I give if matched?'.
Calculate Raw Attention Scores
Compute the dot product between the Query and Key matrices (Q * K^T). This yields a matrix where each entry indicates the 'compatibility' or 'relevance' between a query item and a key item.
Scale and Normalize Scores
Divide the raw scores by the square root of the key's dimension (dk) to stabilize gradients, then apply a softmax function row-wise to obtain attention weights, ensuring they sum to 1.
Compute Weighted Sum
Multiply the attention weights matrix by the Value matrix. Each row in the output represents a weighted sum of the Value vectors, with weights determined by the attention scores.
Implement Scaled Dot-Product Attention
Write a Python function using NumPy to perform the scaled dot-product attention mechanism from scratch.

Starter code

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Computes scaled dot-product attention.

    Args:
        Q (np.ndarray): Query matrix (batch_size, seq_len_q, d_k).
        K (np.ndarray): Key matrix (batch_size, seq_len_k, d_k).
        V (np.ndarray): Value matrix (batch_size, seq_len_v, d_v).
        mask (np.ndarray, optional): Mask for attention scores. Defaults to None.

    Returns:
        np.ndarray: Output matrix (batch_size, seq_len_q, d_v).
        np.ndarray: Attention weights (batch_size, seq_len_q, seq_len_k).
    """
    d_k = Q.shape[-1]
    scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)

    if mask is not None:
        scores = scores + (mask * -1e9) # Apply mask by setting masked values to a very small number

    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    output = np.matmul(attention_weights, V)
    return output, attention_weights

# Example usage:
# batch_size, seq_len, d_model = 1, 3, 4
# Q = np.random.rand(batch_size, seq_len, d_model)
# K = np.random.rand(batch_size, seq_len, d_model)
# V = np.random.rand(batch_size, seq_len, d_model)
# output, weights = scaled_dot_product_attention(Q, K, V)
# print("Output shape:", output.shape)
# print("Weights shape:", weights.shape)

Source

Paperarxiv.org