Convergent representations would reanimate arguments from the philosophy of mathematics
A snippet connecting a question in philosophy of math to philosophy of AI (interpretability)
Sep 19, 2025 • 553 words • #snippet #philosophy #ai
A snippet is a mini post that I do not develop fully/consider deeply but which I want to get out into the world.
There are arguments that neural networks across domains seem to converge on internal representations. Putting aside the accuracy of the convergent linear representations idea, would it mean anything if it were true?
I think if it were, it would connect importantly to a central problem in the philosophy of mathematics (perhaps the central problem?) and philosophy of science. More specifically, the philosophy of applied mathematics; if math with its strainge a priori nature is the language of science — our best1 language for refining descriptions of the workings of world — and is indispensable for describing the world, does that mean that mathematical objects like sets and numbers really exist?
W.V.O. Quine and Hilary Putnam put forward an argument in the affirmative, a derivative of which is articulated by the SEP (good article, you should read the whole thing if interested) as follows:
(P1) We ought to have ontological commitment to all and only the entities that are indispensable to our best scientific theories.
(P2) Mathematical entities are indispensable to our best scientific theories.
(C) We ought to have ontological commitment to mathematical entities.
The Quine-Putnam indispensability argument (which the above approximates) led to all sorts of interesting work, in particular Field’s nominalisation program, in which Field tries to show that you can do physics (in particular, "a large fragment of Newtonian gravitational theory")
without quantification over mathematical entities, and that what we would be left with would be reasonably attractive theories.
This is very interesting and not exactly the point of this post. What I want to consider is the following:
Suppose the convergent linear representation hypothesis is true. Neural networks tend to converge on the same linear representations of the world. Does this mean that they are converging on something 'real', some 'true' description of the world? Are they representations of 'things' that really 'exist'? What if this occurs for really really large multimodal models that are really really intelligent?
I think that this question might tie in to those arguments about indispensability and the models of science, but now in a kind of alien form because your 'models' are now strange linear combinations of semi-ineffable features in a high-dimensional feature-space. It's a different kind of 'model', a probabilistic one, which requires different consideration than a formal deterministic model. But I feel as though there is some underlying similarity here in the question: "Here's some useful description of the world that runs on this set of objects. Are these objects real — or more precisely, should we believe that these objects exist?"
Footnotes
-
Ok, this is a complicated claim that I don't consider entirely true. In domains where science can't reach very well right now, e.g. "how does history work?" or "how does the mind work?", the comprehensive models that science aims to build are not really capable of doing a ton; economics, for example, cannot describe all of history. Each of these models is but an interface for the complex phenomenon. I think your position on this is perhaps parallel to your position on whether psychoanalysis, French theory, and various other kinds of continental philosophy are useful for understanding the world. But that is not the point of this post. ↩