Todd et al., "In-Context Algebra" (2025)

2025-12-18

https://arxiv.org/abs/2512.16902
Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau

In context learning, Mechanistic interpretability

From David Bau’s group. Accepted at ICLR 2026. Contrasts with prior arithmetic interpretability work (Grokking, Power et al. 2022) where tokens have fixed meanings and models learn geometric/Fourier representations—here, without fixed semantics, symbolic/relational mechanisms emerge instead.

Setup#

Small transformer (4 layers, 8 heads, 1024 hidden dim, 18-token vocab) trained from scratch on synthetic finite group algebra. Each training sequence has ~200 facts of the form A B = C drawn from a randomly sampled group (cyclic C₃–C₁₀, dihedral D₃–D₅), with a fresh random mapping from tokens to group elements per sequence. The model learns via next-token prediction—each sequence is already an in-context learning problem where the model must infer token meanings from surrounding facts. Test sequences use new random mappings and, in some experiments, entirely unseen groups.

Main findings#

Five interpretable mechanisms explain ~90% of model behavior:

Verbatim copying (67.9%)—a single attention head (L3H6) retrieves previously seen identical facts
Commutative copying (12.1%)—same head exploits a·b = b·a when exact match unavailable
Identity recognition (4.2%)—promotes both query variables, then demotes the identity element
Associativity (3.6%)—composed fact reasoning, develops last
Closure-based cancellation (2.7%)—tracks group membership to eliminate invalid answers

These emerge in distinct phase transitions during training: structural tokens → closure → copying → elimination → associativity. Later mechanisms build on earlier ones—e.g., identity recognition’s “demotion” reuses earlier “promotion” circuits.

Key contrast with Grokking: without fixed token meanings, the model cannot use Fourier/periodic representations. Instead it develops symbolic/relational strategies (pattern matching, cancellation laws).

Generalizes to unseen groups (near-perfect on held-out order-8 groups) with partial success on non-group algebraic structures.

Setup#

Main findings#

Receive my updates