Paper·arxiv.org
llmresearchfine-tuningevaluationai-agentssecuritycontext-engineering
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
This research investigates the internal mechanisms of representation steering in LLMs, explaining how steering vectors lead to behaviors like refusal. It aims to provide a mechanistic understanding for more precise LLM alignment and control, moving beyond black-box applications.
advanced1-2 hours6 steps
The play
- Grasp Steering Vector TheoryUnderstand the foundational concept of representation steering: how specific vectors are designed to alter an LLM's internal representations to influence its output behavior.
- Identify Target Behaviors for SteeringSelect a specific LLM behavior (e.g., refusal, helpfulness, tone, bias mitigation) that you aim to influence or understand mechanistically through steering.
- Set Up an LLM Environment with Hooking CapabilitiesPrepare an LLM instance (e.g., using Hugging Face Transformers) where you can access and modify intermediate layer activations. Utilize model hooking mechanisms (e.g., `model.register_forward_hook`) to inspect and inject vectors.
- Generate or Acquire Steering VectorsObtain or create steering vectors relevant to your target behavior. These vectors are typically derived from contrastive prompting, fine-tuning, or specific interpretability techniques.
- Inject Steering Vectors and Observe Internal ChangesApply the generated steering vector by adding it to the hidden states of specific intermediate layers within the LLM. Monitor and analyze how these injections alter subsequent layer activations using your hooking setup.
- Evaluate Behavioral ImpactRun a set of test prompts through the steered LLM and quantitatively assess how the injected vector changes the model's output behavior, specifically regarding your identified target (e.g., measure refusal rates, sentiment scores).
Starter code
import torch
# This snippet conceptually demonstrates applying a steering vector.
# In a real scenario, 'model_hidden_state' would come from an LLM layer output,
# and 'steering_vector' would be a pre-calculated vector for a specific behavior.
# Assume a dummy hidden state from an LLM layer (e.g., a transformer block output)
# Shape: (batch_size, sequence_length, hidden_size)
model_hidden_state = torch.randn(1, 128, 768)
# A hypothetical steering vector, designed to influence a behavior (e.g., to increase 'refusal')
# This vector should have the same hidden_size as the layer it's applied to.
# For simplicity, we'll broadcast it across sequence_length.
steering_vector = torch.randn(1, 1, 768) * 0.8 # Scale factor (0.8) adjusts steering strength
# Apply the steering vector by adding it to the model's internal representation
# This modified state would then be passed to the next layer of the LLM.
steered_hidden_state = model_hidden_state + steering_vector
print("Original Hidden State (first 5 values of first token):", model_hidden_state[0, 0, :5])
print("Steering Vector (first 5 values):", steering_vector[0, 0, :5])
print("Steered Hidden State (first 5 values of first token):", steered_hidden_state[0, 0, :5])
print("\nConceptual: This 'steered_hidden_state' would then feed into the next LLM layer.")Source