Everything filed under interpretability



Understanding the Parameter Decomposition papers

Understanding attribution-based and stochastic parameter decomposition methods

What's different about a Matryoshka SAE?

Brief notes from the Matryoshka SAEs paper.

10 Autoencoders in a Trenchcoat, part 1

Notes on the core sections of Anthropic's Toy Models of Superposition.

Notes on "A Mathematical Framework for Transformer Circuits"

Close-reading a classic interpretability paper and trying to make sense of it