Everything filed under interpretability

Better gradient attributions from Integrated Gradients to RelP

Explaining integrated gradients and RelP, an alternative method

#notes #technical

Nov 26, 2025

Gradient-Diff Steering for Behavior Editing in Small LMs

A very early research update describing two experiments I've run using gradient- and weight-based methods to localize behaviors acquired by finetuning within the diff.

#technical

Sep 07, 2025

Understanding the Parameter Decomposition papers

Understanding attribution-based and stochastic parameter decomposition methods

#notes #technical

Jul 06, 2025

What's different about a Matryoshka SAE?

Brief notes from the Matryoshka SAEs paper.

#ml #notes #technical

Jun 30, 2025

10 Autoencoders in a Trenchcoat, part 1

Notes on the core sections of Anthropic's Toy Models of Superposition.

#ml #notes #technical

Jun 25, 2025

Notes on "A Mathematical Framework for Transformer Circuits"

Close-reading a classic interpretability paper and trying to make sense of it

#ml #notes #technical

Jun 14, 2025