Todd et al., "In-Context Algebra" (2025)
2025-12-18
- https://arxiv.org/abs/2512.16902
- Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau
In context learning, Mechanistic interpretability
From David Bau’s group. Accepted at ICLR 2026. Contrasts with prior arithmetic interpretability work (Grokking, Power et al. 2022) where tokens have fixed meanings and models learn geometric/Fourier representations—here, without fixed semantics, symbolic/relational mechanisms emerge instead.
Setup#
Small transformer (4 layers, 8 heads, 1024 hidden dim, 18-token vocab) trained from scratch on synthetic finite group algebra. Each training sequence has ~200 facts of the form A B = C drawn from a randomly sampled group (cyclic C₃–C₁₀, dihedral D₃–D₅), with a fresh random mapping from tokens to group elements per sequence. The model learns via next-token prediction—each sequence is already an in-context learning problem where the model must infer token meanings from surrounding facts. Test sequences use new random mappings and, in some experiments, entirely unseen groups.
Main findings#
Five interpretable mechanisms explain ~90% of model behavior:
- Verbatim copying (67.9%)—a single attention head (L3H6) retrieves previously seen identical facts
- Commutative copying (12.1%)—same head exploits a·b = b·a when exact match unavailable
- Identity recognition (4.2%)—promotes both query variables, then demotes the identity element
- Associativity (3.6%)—composed fact reasoning, develops last
- Closure-based cancellation (2.7%)—tracks group membership to eliminate invalid answers
These emerge in distinct phase transitions during training: structural tokens → closure → copying → elimination → associativity. Later mechanisms build on earlier ones—e.g., identity recognition’s “demotion” reuses earlier “promotion” circuits.
Key contrast with Grokking: without fixed token meanings, the model cannot use Fourier/periodic representations. Instead it develops symbolic/relational strategies (pattern matching, cancellation laws).
Generalizes to unseen groups (near-perfect on held-out order-8 groups) with partial success on non-group algebraic structures.