Everything filed under interpretability



Gradient-Diff Steering for Behavior Editing in Small LMs

A very early research update describing two experiments I've run using gradient- and weight-based methods to localize behaviors acquired by finetuning within the diff.

Understanding the Parameter Decomposition papers

Understanding attribution-based and stochastic parameter decomposition methods

What's different about a Matryoshka SAE?

Brief notes from the Matryoshka SAEs paper.

10 Autoencoders in a Trenchcoat, part 1

Notes on the core sections of Anthropic's Toy Models of Superposition.

Notes on "A Mathematical Framework for Transformer Circuits"

Close-reading a classic interpretability paper and trying to make sense of it