The Algebra of Language: Unveiling the Statistical Tapestry of Form and Meaning

April 27, 2024

A statistical tapestry

”. . . the fact, as suggested by these findings, that semantic properties can be extracted from the formal manipulation of pure syntactic properties – that meaning can emerge from pure form – is undoubtedly one of the most stimulating ideas of our time.”

—The Structure of Meaning in Language: Parallel Narratives in Linear Algebra and Category Theory

In our last post, we began exploring what Large Language Models (LLMs) and their uncanny abilities might tell us about language itself. I posited that the power of LLMs stems from the statistical nature of language.

But what is that statistical nature of language?

A couple of years ago, I happened to listen to a podcast conversation between physicist Sean Carroll and mathematician Tai-Danae Bradley that touched on this topic that I found quite fascinating. So it came back to my mind as I was pondering all of this. In the conversation, Bradley describes the algebraic nature of language due to the concatenation of words. She notes that the statistics and probabilities of word co-occurrences can serve as a proxy for grammar rules in modeling language, which is why LLMs can generate coherent text without any explicit grammar rules.

She also shares a theory called the Yoneda Lemma:

“The Yoneda Lemma says if you want to understand an object, a mathematical object, like a group or a space, or a set, the Yoneda Lemma says that all of the information about that object is contained in the totality of relationships that object has with all other objects in its environment.”

She then links that mathematical concept to linguistics:

“. . . there’s a linguist, John Firth, I think in a 1957 paper, he says, “You shall know a word by the company it keeps. . . So what’s the meaning of fire truck? Well, it’s kind of like all of the contexts in which the word fire truck appears in the English language. . . everything I need to know about this word, the meaning of the word fire truck, is contained in the network of ways that word fits into the language.”

Since this interview, frontier LLMs have demonstrated that there is quite a bit of meaning that can be derived from the context and co-occurrences in which words show up in a body of language.

In a more recent paper, Bradley and co-authors Gastaldi and Terilla make the statement that I began this post with, which I will re-post here again, as it’s worth pondering:

”. . . the fact, as suggested by these findings, that semantic properties can be extracted from the formal manipulation of pure syntactic properties – that meaning can emerge from pure form – is undoubtedly one of the most stimulating ideas of our time.” [bold added]

They go on to further state:

“Therefore, the surprising properties exhibited by embeddings are less the consequence of some magical attribute of neural models than the algebraic structure underlying linguistic data found in corpora of text.”

In other words: LLMs (a type of artificial neural network) derive their generative linguistic capabilities from the algebraic and statistical properties of the texts they are trained upon. And the fact that they can do so suggests that the form and structure of language is intimately intertwined with its meaning.

In previous post, I referred to a Bloom and Lahey model from 1978, which delineates three components of language, form, meaning, and use:

Bloom and Lahey's model of language

Over the past few decades of linguistic research and language teaching, there may have been trends in a focus on one of those components over the other — in the past teachers of English as a second language, for example, may have put a stronger emphasis on the teaching of grammar, while more recent TESOL teachers may put a stronger focus on meaning over form. A more current strand of linguistics research focuses on “usage-based” theories.

There is some parallel in the edu sphere related to reading, in that there have been varying emphases in the research and practice on code-based (form) vs. meaning-based skills (i.e. the Simple View of Reading), with a more recent shift back to code-based emphasis, now seemingly defined by a perpetual tug-of-war between the two.

The Simple View of Reading

Rarely made explicit in any of these shifts in focus has been the assumption that form and meaning can be completely disentangled. After all, a writing system is somewhat arbitrary as a pairing of spoken sounds to symbols. This is, according to a 1980 account by Gough and Hillinger, one of the reasons that learning to decode can be so very difficult–because there isn't meaning in those symbols in-and-of themselves. It is rather the abstraction of what they represent that we need to learn.

Yet what if form and meaning are much more closely interwoven than we may have assumed? What if, in fact, a large quantity of meaning can be derived merely from an accumulated volume of statistical associations of words in sentences?

That LLMs have the abilities they do, given that they have not acquired language in the way that humans have (via social and physical interaction in the world) and without cognition, would seem to suggest that that the “mere” form and structure of a language possesses far more information about our world than we would have assumed – and that meaning is deeply and fundamentally interwoven with form.

More to ponder!

Some additional interesting sources on these topics to further explore (thanks to Copilot for the suggestions):

#AI #language #learning #statistics #mathematics #cognition #machinelearning