What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

This research investigates the internal mechanisms of representation steering in LLMs, explaining how steering vectors lead to behaviors like refusal. It aims to provide a mechanistic understanding for more precise LLM alignment and control, moving beyond black-box applications.

advanced1-2 hours6 steps

The play

Grasp Steering Vector Theory
Understand the foundational concept of representation steering: how specific vectors are designed to alter an LLM's internal representations to influence its output behavior.
Identify Target Behaviors for Steering
Select a specific LLM behavior (e.g., refusal, helpfulness, tone, bias mitigation) that you aim to influence or understand mechanistically through steering.
Set Up an LLM Environment with Hooking Capabilities
Prepare an LLM instance (e.g., using Hugging Face Transformers) where you can access and modify intermediate layer activations. Utilize model hooking mechanisms (e.g., `model.register_forward_hook`) to inspect and inject vectors.
Generate or Acquire Steering Vectors
Obtain or create steering vectors relevant to your target behavior. These vectors are typically derived from contrastive prompting, fine-tuning, or specific interpretability techniques.
Inject Steering Vectors and Observe Internal Changes
Apply the generated steering vector by adding it to the hidden states of specific intermediate layers within the LLM. Monitor and analyze how these injections alter subsequent layer activations using your hooking setup.
Evaluate Behavioral Impact
Run a set of test prompts through the steered LLM and quantitatively assess how the injected vector changes the model's output behavior, specifically regarding your identified target (e.g., measure refusal rates, sentiment scores).

Starter code

import torch

# This snippet conceptually demonstrates applying a steering vector.
# In a real scenario, 'model_hidden_state' would come from an LLM layer output,
# and 'steering_vector' would be a pre-calculated vector for a specific behavior.

# Assume a dummy hidden state from an LLM layer (e.g., a transformer block output)
# Shape: (batch_size, sequence_length, hidden_size)
model_hidden_state = torch.randn(1, 128, 768) 

# A hypothetical steering vector, designed to influence a behavior (e.g., to increase 'refusal')
# This vector should have the same hidden_size as the layer it's applied to.
# For simplicity, we'll broadcast it across sequence_length.
steering_vector = torch.randn(1, 1, 768) * 0.8 # Scale factor (0.8) adjusts steering strength

# Apply the steering vector by adding it to the model's internal representation
# This modified state would then be passed to the next layer of the LLM.
steered_hidden_state = model_hidden_state + steering_vector

print("Original Hidden State (first 5 values of first token):", model_hidden_state[0, 0, :5])
print("Steering Vector (first 5 values):", steering_vector[0, 0, :5])
print("Steered Hidden State (first 5 values of first token):", steered_hidden_state[0, 0, :5])
print("\nConceptual: This 'steered_hidden_state' would then feed into the next LLM layer.")

Source

Paperarxiv.org