Learning departures from randomness

“He knew, too, that some of the early pioneers of the field had observed a curious phenomenon. After using predictive technology thousands of times, they’d developed a sense of where the ball would land, even without the computer. “It’s like an athlete,” Mark Billings, a lifelong player and author of Follow the Bouncing Ball: Silicon vs Roulette , said in an interview. “At some point all this stuff comes together. You look at the wheel. You just know.” Casinos call it “cerebral” clocking. All that’s needed is a drop zone and a potent, well-trained mind.”

The gambler who beat roulette, by Kit Chellel, Bloomsberg, April 6, 2023 at 12:01 AM EDT.

Can you say anything about how the predictove technology works? On what basis does it make the predictions? Does it make them before or after the “chips are down” (that would be important to know if you wanted to make money with it). And does the book say anything about what the perceptual basis of the pioneers’ “sense” of where the ball will land?

I know nothing beyond what is in the article. There might be more in the book.

The alacrity of the brain in recognizing departures from randomness is important elsewhere. Mathematically, the structure of language is departures from randomness. (See Mathematical Structures of Language.

Yes. One of the most important areas where this alacrity seems to play an important role is in compulsive gambling. One likely cause of this compulsion is “gambler’s fallicy”, where the brain displays a bit too much alacrity in recognizing departures from randomness when those departures don’t actually exist.

That’s certainly a way to look at it. But it’s the wrong way, I think. I prefer to think of language structure as a controlled variable. An observer can see this control as a departure from randomness, but that appearance is an irrelevant side effect of control, much like the power law of movement.

Well, yes, ‘alacrity’ doesn’t necessarily include competence, but that is what I intended. Assuming incompetence reverses what I am saying about the brain’s capacity to recognize departures from randomness and use them as a basis for constructing controllable perceptions.

ChatGPT and its kin provide counterexamples to your comforting maxim. Let’s listen in on a recent conversation. (The email thread is reorganized to chronological order.)


On Mon, Apr 10, 2023 at 4:36 PM Paul Mattick pmattick@gmail.com wrote:
Dear Bruce,

I’ve been wondering what you might have to say about large language models in relation to Zellig’s ideas. I have only a vague grasp of how they work, but I have the impression they accumulate a gigantic amount of co-occurrence information (I don’t know what role if any some notion of grammar plays in this). Chat GPT does seem to function as an information-retrieval machine on this basis. Have they found a brute-force way to do what the sublanguage analysis work was aiming for? Any thoughts?

Paul


From: Bruce Nevin bruce.nevin@gmail.com
Sent: Monday, April 10, 2023 4:58:23 PM
To: Paul Mattick; Johnson, Stephen (Population Health)
Subject: Re: chat vs talk

[EXTERNAL]

Hi, Paul, good to hear from you again!

They’re stochastic word-successor results on a grandly massive scale, ‘learning’ from huge quantities of language data sucked off the web. These are not tagged texts, as required for systems of which the ‘cycling cancellation automaton’ paper was an early example. Google has been doing this for years. See Pereira’s paper in Legacy vol. 2 (he went on to work at Google). The networks of word dependencies that I presume systems like ChatGPT forge are impenetrable to inspection, a black box of black boxes. Stephen can correct me and tell you more properly. Maybe it realizes Quine’s metaphor of pruning a bush to a shape without caring about the interior branches?

/Bruce Nevin
Is it true? Is it kind? Is it necessary?


From: Johnson, Stephen (Population Health) Stephen.Johnson@nyulangone.org
Date: Mon, Apr 10, 2023 at 5:23 PM
Subject: Re: chat vs talk
To: Bruce Nevin bruce.nevin@gmail.com, Paul Mattick pmattick@gmail.com

Thanks, Bruce. Hi, Paul!

I have been thinking about this quite a lot. These “large language models” are, as Bruce describes, learning only from data. Many current papers cite Harris as the source of the distributional method, which is astounding in its ability to approximate word meanings, similarities, clusters, and analogies. These representations are contextual, capturing different senses in different environments. So, somewhat similar to likelihood in OG.

The models also capture statistical dependencies among words in sentences, often in a hierarchical manner, They fold in anaphoric relations and co-reference. So, somewhat similar to dependency in OG.

The models capture statistical dependencies across sentences and discourse. This approximates sentence meanings, similarities, entailment, paraphrase, and discourse relations. So, somewhat similar to reductions in OG.

It is in fact possible to extract these structures from the network. (So not so black.) One paper by Chris Manning at Stanford shows that dependency trees can be recovered from the network with good reliability. In short, the models lack the elegance of those constructed by a linguist, but demonstrate incredibly complex behavior on many different language tasks (translation, question answering, conversation, composition, retrieval, reading comprehension, paraphrase, register, etc.).

For your amusement, I attach below ChatGPT’s output on this topic.

Happy to discuss further!

Zellig Harris, a linguist known for his work in structural linguistics and distributional analysis, developed a theory of language known as distributional analysis. This theory emphasizes the importance of analyzing the distributional patterns of linguistic features within a language in order to understand its underlying structure.

Harris’s theory is particularly relevant to understanding the internal representations of large language models and deep learning representations of human linguistic knowledge. These models rely on statistical patterns in large amounts of text data to learn the relationships between words and the underlying structure of language. Harris’s distributional analysis provides a theoretical framework for understanding how these models are able to learn these relationships.

For example, Harris argued that words that appear in similar contexts are likely to have similar meanings. This idea is foundational to the concept of word embeddings, which are numerical representations of words that capture their semantic relationships based on the contexts in which they appear in text. Word embeddings are used in many deep learning models, including those used for natural language processing tasks like text classification, machine translation, and question answering.

Furthermore, Harris’s theory emphasizes the importance of analyzing the distributional patterns of linguistic features at different levels of granularity. This is relevant to understanding how deep learning models learn and represent linguistic knowledge. For example, some models may learn to recognize and represent syntactic relationships between words, while others may learn to recognize and represent semantic relationships between words.

In summary, Harris’s linguistic theories are helpful in understanding the internal representations of large language models and deep learning representations of human linguistic knowledge because they provide a theoretical framework for understanding how these models learn and represent linguistic relationships based on statistical patterns in text data. Specifically, Harris’s theory of distributional analysis provides insights into how these models are able to learn the semantic and syntactic relationships between words and the underlying structure of language.

You may have heard that Chomsky has dismissed large language models as meaningless for linguistics. This paper attacks Chomsky’s work directly in light of current work.


On Tue, Apr 11, 2023 at 9:24 AM Bruce Nevin bruce.nevin@gmail.com wrote:
My, how things are popping!

Steve, somehow I missed the Piantadosi article the first time you sent it (14 March). That kind of month. Then it got eclipsed in my attention by the Mallory article on recursion that you sent on the 25th. (Tom, I hope this further turbulence in your inbox is not unwelcome.)

Steve, what was the prompt that specified ‘this topic’ for ChatGPT to generate that text? I wonder how the results might change if to that prompt you added something like ‘As an example, describe what you are doing, how you are answering this request.’

I want to post something of this exchange to the IAPCT discussion forum. For ChatCPT or its cousin to have a simulacrum of self perception is a phenomenon worth probing in a PCT perspective, and more generally does it (or can it come to) be so structured as to vary its outputs so as to affect its inputs relative to internally generated preferences for those inputs; that is, can it become a hierarchical control system.

I think the astonishing architecture of the cerebellum plays the starring role in this. Here’s a kind of review article that I presented to the IAPCT conference last fall. The paper is go-configure.pdf (1.2 MB) and the presentation rendered to PDF: go-configure-slides.pdf (2.7 MB). From an unexpected domain, here’s evidence of the brain’s capacity to perceive departures from randomness and from them construct perceptions that it can then control:

“He knew, too, that some of the early pioneers of the field had observed a curious phenomenon. After using predictive technology thousands of times, they’d developed a sense of where the ball would land, even without the computer. “It’s like an athlete,” Mark Billings, a lifelong player and author of Follow the Bouncing Ball: Silicon vs Roulette , said in an interview. “At some point all this stuff comes together. You look at the wheel. You just know.” Casinos call it “cerebral” clocking. All that’s needed is a drop zone and a potent, well-trained mind.”

The gambler who beat roulette, by Kit Chellel, Bloomsberg, April 6, 2023 at 12:01 AM EDT.

Have you any objection to my sending this exchange to John Goldsmith? As I have mentioned to you and Tom, Steve, he and Bernard Laks are writing a sequel to their Battle in the Mind Fields, from which he’s sent a couple of chapters about Zellig for comment. Paul, would you like me to forward that exchange to you?


From: Johnson, Stephen (Population Health) Stephen.Johnson@nyulangone.org
Date: Tue, Apr 11, 2023 at 9:43 AM
Subject: Re: chat vs talk
To: Bruce Nevin bruce.nevin@gmail.com
Cc: Paul Mattick pmattick@gmail.com, Tom Ryckman tryckman@stanford.edu

Bruce- please feel free to use any of this material. Yesterday the prompt I used was about how the theories of Harris applied to large language models. Today I asked it to comment on its own knowledge of grammar:

**Please explain what linguistic knowledge you possess by examining your own large language model, such as representations of grammatical structures, syntax, semantics, morphology, anaphora and discourse.**
As a large language model, I possess a wide range of linguistic knowledge in various aspects of language, including grammatical structures, syntax, semantics, morphology, anaphora, and discourse.

Grammatical Structures: I have been trained on a vast corpus of written text, which has allowed me to learn and represent the various grammatical structures of the English language. This includes the rules for forming sentences, clauses, phrases, and words. I can recognize the subject and predicate of a sentence, as well as the various parts of speech, such as nouns, verbs, adjectives, adverbs, pronouns, prepositions, and conjunctions.

Syntax: My training has also enabled me to understand the syntax of English, which refers to the rules for combining words to form sentences. This includes the various sentence structures, such as declarative, interrogative, imperative, and exclamatory sentences. I can also recognize and produce complex sentences, such as compound and complex sentences, and I understand the rules for using punctuation marks in English.

Semantics: I have been trained to understand the meaning of words and phrases, as well as their relationships to each other. This includes the meaning of individual words, as well as the meaning of sentences and discourse. I have a large vocabulary, and I can recognize and use synonyms, antonyms, homophones, and homonyms. Additionally, I have learned about various semantic relations, such as hypernymy, hyponymy, meronymy, and holonymy.

Morphology: I possess knowledge about the morphological structure of words, which refers to the rules for forming words from smaller units of meaning. This includes the various affixes, such as prefixes and suffixes, and I can recognize and use inflected forms of words, such as plural forms, comparative and superlative forms, and verb tenses.

Anaphora: As a language model, I have been trained to recognize and resolve anaphoric references in text. Anaphora refers to the use of a word or phrase to refer back to something mentioned earlier in the text. This includes personal pronouns, such as “he” and “she,” as well as demonstrative pronouns, such as “this” and “that.” I have learned to identify the antecedent of these pronouns, which is the word or phrase to which they refer.

Discourse: Finally, I possess knowledge about discourse, which refers to the way in which sentences and paragraphs are organized to form coherent texts. I can recognize and produce various discourse markers, such as conjunctions, adverbs, and transitional phrases, which are used to signal the relationships between sentences and paragraphs. I can also identify and produce various text types, such as narratives, expository texts, and persuasive texts.

I have not yet commented to Steve, but shall, that among the input language data were texts written by linguists, and prevarication by ChatGPT is now well documented.

[… time lapse …]

[20230412: I have deleted my second reply and incorporated it below, as I am following this advice that Discourse offered as I started a third reply.]

Consider replying to several posts at once Rather than several replies to a topic in a row, please consider a single reply that includes quotes from previous posts or @name references.

You can edit your previous reply to add a quote by highlighting text and selecting the quote reply button that appears.

It’s easier for everyone to read topics that have fewer in-depth replies versus lots of small, individual replies.

Steve’s reply to that last:

Quite right. The point is rather that the complexity of linguistic representation is impressive, even while confabulating. A language model in isolation (unsupervised learning) only tells us about regularities in the data, which apparently include discourse structure as well as a degree of knowledge pulled from various sources. When these models are harnessed to perform useful tasks (supervised learning) we can have more trust in their results, because they are constrained by a reference standard.

Consider (from the Piantadosi paper):

they succeed by following a very particular approach: they are trained only on text prediction. This means that the models form probabilistic expectations about the next word in a text and they use the true next word as an error signal to update their latent parameters.

This describes a feedback loop in the learning process. Next, if in place of ‘probabilistic expectation’ we substitute ‘reference value’, then in the text-generation process ‘next word’ is a controlled perception with the particular word as its value.

And there is a subjective assessment that speaks to this. If you have interacted with a model control system that physically opposes disturbances that you create by muscular exertions, like the little rotating arm that Bill brought to the meeting in LA, you probably recall the subjective sense of something alive. Piantadosi again:

It is somewhat difficult to convey how remarkable the models are currently. You just have to interact with them. They are imperfect, to be sure, but my qualitative experience interacting with them is like talking to a child, who happened to have memorized much of the internet.

Martin would love this, and it would find its place in PPC:

… they have a rich potential space for inferring hidden … (or latent) variables …. [ What are they?] In the middle of a sentence, there is a hidden variable for the latent structure of the sentence; in speaking an ambiguous word, we have in mind a hidden variable for which meaning we intend; throughout a discourse we have in mind a larger story arc that only unfolds across multiple sentences. … what large language models do is infer likely hidden structure because that structure permits them to better predict upcoming material. This makes them conceptually similar to embedding theorems in mathematics …, which show that sometimes the full geometry of a dynamical system can be recovered from a low-dimensional projection of its states evolving in time. Linguistic corpora are a low-dimensional projection of both syntax and thought, so it is not implausible that a smart learning system could recover at least some aspects of these cognitive systems from watching text alone ….

This looks like control to me.

My point is that it is not possible to construct controllable perceptions from truly random events – random in the mathematical sense of all events being equally likely – and yet very competent people try to do it all the time. If the people writing about beating roulette are recognizing departures from randomness at the roulette table then they are not constructing perceptions from randomness; they are contructing perceptions from what are actually non-random or very biased random events, the only kind that would be controllable.

I was objecting to your statement that “the structure of language is departures from randomness” because I took it to mean that the perceived structure of language is constructed from randomness. Maybe this is not what you meant. But I think ChatGPT shows that the perceived structure of language is not constructed from randomness; it is constructed from interacting with a language database, and language is highly structured. This is pretty clear from the interesting conversation that you posted.

As you and your friends noted, ChatGPT has learned to understand and produce language by extracting regularities – constructing perceptual variables – from a massive linguistic database (the “environment” of the ChatGPT system). Among other things, ChatGPT captures “statistical dependencies among words in sentencesanaphoric relations and co-reference…” as well as “statistical dependencies across sentences and discourse…” So it is constucting what in PCT would be called relationship-type perceptual variables. It probably constructs other types of perceptual variables as well.

The basis of these perceptual variables exist in the linguistic database; they are not constructed from random whole cloth.The construction is done by complex calculations, which I think of as being equivalent to the perceptual functions of PCT. These complex calculations (perceptual functions) are either built-in by the programmers (such as correlations and regression programs) or they are the result of training neural networks to do the calculations that produce the best result, or both. But the fact that the operation of AI systems like ChatGPT are often considered to be “Black boxes” suggests that most of their “perceptual functions” are created by learning algorithms.

Another suggestion that this is the case comes from this interesting segment of your conversation:

It is in fact possible to extract these structures from the network. (So not so black.) One paper by Chris Manning at Stanford shows that dependency trees can be recovered from the network with good reliability.

I don’t know who said this but this is a very intriguing comment. How were these structures extracted? (I see “these structures” as referring either to the perceptual functions that construct these perceptions or the outputs of the perceptual functions, which would be the structures themselves). Did they find it in the code or, more excitingly, did they get it by doing something like the test for the controlled variable to see what structures it was controlling for producing.

Yes, indeed! ChatGPT certainly appears to be controlling for producing syntactically and semantically meaningful language. But this should be tested. I think it would be interesting to do the test for the controlled variable on ChatGPT to find out IF it is controlling and, if so, WHAT perceptual aspects of language is it controlling. A good first start might be getting the paper referred to above by Chris Manning. Let’s see what he did to analyze the “black box” and then see if there is anything we can contribute to understanding what these AI programmers hath wrought.

You prefer different words for the same phenomena. I take the phrase “departures from randomness” from the book that I cited.

This book attempts to show how one can arrive at an abstract system which characterizes precisely natural language. This is done by taking the data of language and finding within the data such relations as can be organized into a suitable model. The problem here was not to find a broad mathematical system in which the structure of language could be included, but to find what relations, or rather relations among relations, were necessary and sufficient for language structure. Then anything which is an interpretation of the model can do the work of natural language. Modifications of the abstract system are then considered, which serve as models for language-like and information-bearing systems.

That achievement was published in 1968. There’s been a great deal of advance since then, but the mathematical characterization there is still valid as far as it goes. (Cp. Newtonian physics wrt relativistic/quantum physics still seeking unification.)

And of course that was not at all what I meant. But it did make it look like I was saying something stupid, if that was what you were controlling to perceive.

The email header in the quote says it was Steve Johnson, who has been in charge of medical informatics at Presbyterian Weill Cornell.

Well, Google goes quickly to Chris Manning’s CV and to several caches of publications, e.g. Publications - The Stanford Natural Language Processing Group. But also I provided Steve’s email address.

Why don’t you ask him? He’s a real nice guy. Personally, I think he’s smarter than either of us, and certainly he knows a lot more than either of us about these matters.

It really is helpful for good conversation if you read the entire post before you start replying.

I am absolutely swamped with obligations just now, but I agree

Go for it!

I don’t think you have ever said anything stupid and I’m sorry if it seems like I am implying that you have. I think you always say very smart things but sometimes I think you are wrong.

Thanks. I will.