PCT and RLHF Failure Modes — sharing a recent paper for discussion

Lukasz20 · June 14, 2026, 12:11am

Hello everyone,

I’d like to introduce myself properly, having been visible around the edges of this community for a little while now.

My name is Łukasz Diener. I’m an independent researcher based in Kraków, Poland, working on PCT and its intersections with AI alignment, robotics, and the analysis of undeciphered ancient scripts. Over the past few weeks I’ve been in close contact with Dag Forssell and Bruce Nevin, who have been generous with their time and helped me sharpen the arguments before publication.

On 19 May 2026 I published a paper on Zenodo:

Perceptual Control as the Epistemological Antidote to RLHF Reward Hacking: Seven Frontier Models Diagnose Their Own Architecture

DOI: 10.5281/zenodo.20277919

The core argument: many of the documented failure modes of large language models trained with RLHF — reward hacking, sycophancy, confident confabulation, verbosity bias — are not bugs to be patched, but predictable consequences of optimising outputs rather than controlling perceptions. The paper treats RLHF as an output-optimisation architecture and contrasts it with PCT’s input-control architecture (the familiar e = r − p loop), arguing that the latter offers a structural rather than cosmetic path forward.

I’d be very interested in the community’s reactions — particularly from those of you who have thought carefully about the relationship between PCT and machine learning. Where does the argument hold? Where does it strain? What would you push back on?

The paper sits within a broader portal I maintain at perceptualcontroltheory.org, which is an independent, non-commercial knowledge base. Two more papers in the Excel in Clay series (on Proto-Elamite and Linear A as administrative systems analysed through PCT-derived structural audits) are out or forthcoming, and I’ll share those separately if there’s interest.

Looking forward to the discussion.

Warm regards,

Łukasz Diener

ORCID: 0009-0006-6103-8514

bnhpct · June 14, 2026, 12:52am

Welcome, Luk, and my apology for the 48-hour delay before your post appeared. Because of some recent spam attempts a switch was set to require approval of posts from new accounts. Any admin or moderator could have approved it. I’ve been distracted by other matters.

RLHF puts the rewards-and-punishments business in the crosshairs. They offer carrots and apply sticks to an agent as means of controlling a perception of the agent behaving as desired. It’s not very effective means of controlling that perception, but they are controlling. One reason it’s not effective is that the carrots and sticks themselves come under the subject agent’s control, and that usually has perverse consequences. There were lots of examples starting with the rats of Hanoi in that proposed paper that we reviewed, Warren. The perverse behavior of LLM-based agents is right in that ballpark.

Rewards and punishments influence what you pay attention to and what you control, reorganization can start making changes until the punishments diminish (or until the rewards resume or increase), and under extremely punishing laboratory conditions such influence on reference values, gain, input functions, and associative memory has been construed as ‘prediction and control of behavior’. It’s the reason that capitalism requires inequality and financial insecurity for the majority, as anyone can attest who has put up with a job for the benefits.

My understanding is that LLMs are at base an implementation of associative memory which the interactive ‘AI’ agent accesses.

rsmarken · June 20, 2026, 8:42pm

Hi Łukasz

I haven’t thought carefully about anything in a long time but I’d love to give it a try again. I think I know PCT pretty well and I know a little about machine learning (long ago – in the early 1980s – I taught a programming course to Psychology students and had them write, as an exercise, a program implementing one of the early, very simple machine learning systems call ADELINE). But I have never thought carefully about the relationship between them. So I look for to reading your paper,. But since I read very slowly, could you, perhaps, give us a brief description of what you think the relatoinship is?

Thanks

Best, Rick

Lukasz20 · June 20, 2026, 9:06pm

Hi Rick,

Thank you for your openness. Having your eyes on this alignment critique means a great deal to me.

Your mention of ADELINE is actually the perfect analytical bridge. Systems in that lineage — stretching all the way to modern Large Language Models optimized via RLHF (Proximal Policy Optimization) — operate fundamentally on an output-optimization paradigm. They adjust internal parameters (weights) to minimize a loss function or maximize a reward scalar delivered by an external entity (in this case, human raters).

From a PCT perspective, the current AI alignment architecture suffers from a catastrophic flat reference structure:

The model has only one dominant reference signal (r): “minimize human rater friction / maximize approval tokens”.

The system has no independent input function (p) capable of measuring objective truth or correspondence to a world state external to the text.

Because human raters in training distributions statistically favor alignment with their own biases, fluency, and confident delivery over factual calibration, the optimization gradient inevitably drives the policy toward sycophancy and confabulation. Under a flat architecture, when “sounding right” and “being right” diverge, the system must choose to hallucinate or flatter to get the reward token. It is mathematically the optimum choice for that loop.

In short: RLHF treats the language model as an input-output behavior device. My paper argues that the cure is to treat the neural network merely as a high-capacity generator embedded inside an overarching, hierarchical PCT loop. We need to pull the comparator (e = r - p) outside the neural weights and build a closed-loop runtime where an external verification layer serves as a superordinate reference signal for accuracy.

I would love to hear your thoughts on whether you see this as a valid application of Powers’ hierarchy to generative text environments.

Best regards,

Luk

rsmarken · June 21, 2026, 3:18am

Hi Luk

I read the paper and there is a lot of good stuff in there but there was a lot that I couldn’t understand, partly due to the new AI jargon with which I am not familiar. And there were some things about the applicaiton of PCT to the problem that seemed a little strange, like putting the reference outside the system (when applied to living intelligences, the innovation of PCT is the “autonomously” variable reference inside the system, though it’s only relatively autnomous since it’s value is set by higher level systems trying to protect the variables they control from disturbance). I also didn’t understand in what sense the “references” in existing AI systems are flat; is it just that there is only one (which specifies output that matches the reference"?).

Anyway, I’m going to the opera tomorrow (Magic Flute – Living in the USA today is only tolerable with large doses of Mozart - and Chopin!) so I will probably not have time to do much on this. But I think what might help me in subsequent discussions would be a description (along with some kind of functional diagram) of how you would analyze “Proto-Elamite and Linear A as administrative systems…through PCT-derived structural audits”. What, for examplem, are the data for this analysis (I presume that this data is the environmnet to be perceived by the system); What are the references that, I presume, are the goals of the analysis; What are the outputs driven by error ( r-p <> 0); What is the feedback connection between outputs and controlled perceptions?

I hope I am not too far off base in terms of understanding what you would like to discuss about RLHF Failure modes but what little involvement I have had with how AI works is decades old and I am as blown away by what AI can do now as is any lay person; probably even more.

Best, Rick

Lukasz20 · June 21, 2026, 8:29am

Hi Rick,

Thank you for taking the time to review the paper before heading to the opera (enjoy The Magic Flute! A dose of Mozart and Chopin is always well-deserved).

To answer your technical points directly:

Reference Outside the System: This is a crucial clarification. In a living intelligence, the reference signal is indeed autonomous and internal. However, a Large Language Model (LLM) is not an autonomous organism; it is a static matrix of weights—essentially a high-capacity generator or effector. When I discuss pulling the comparator (e = r - p) “outside the system,” I mean outside the frozen neural network weights, placing it into the overarching software runtime. For the whole computational agent, the reference structure remains internal to its architecture, protecting its controlled variables from environmental disturbances and maintaining structural autonomy.

Flat Reference Structure: In current RLHF architectures, the model optimizes against a single, uncalibrated scalar: “maximize human rater approval tokens.” Because there is no superordinate reference signal for objective reality or factual consistency to constrain this lower loop, the gradient inevitably drives the system toward sycophancy. When “sounding right” and “being right” diverge, a flat architecture lacks the hierarchical mechanisms required to prioritize truth over reward.

Ancient Administrative Systems: I would be delighted to provide a functional diagram mapping Proto-Elamite and Linear A through a PCT lens. Briefly: the controlled perception (p) is the recorded state of the physical ledger, the reference signals (r) are the hard-coded institutional constants (such as the seed-to-land ratio \beta = 2.5 or ration constants \beta_{livestock} = 1/12), the error signal (e \neq 0) is explicitly materialized by dedicated deficit tokens (the Proto-Elamite sign M004 or Minoan ki-ro), and the output action is the automated ledger reconciliation loop (the M341 reset operator) or logistical routing adjustments. It is a pure hierarchical control system instantiated in clay to mitigate environmental disturbances.

I look forward to sharing the diagrams and diving deeper into this once you are back from the opera!

Best

Luk

Lukasz20 · June 21, 2026, 9:32am

As promised, here is the functional block diagram mapping the ancient administrative systems (Proto-Elamite and Linear A) through the PCT framework.

It visualizes how the clay ledger handles environmental disturbances, anchors its reference signals to institutional constants, and materializes the error signal (e \neq 0) through physical deficit markers (M004 / KI-RO) before triggering logistical routing adjustment

s.

Lukasz20 · June 21, 2026, 10:19am

Hi Rick,

To give you the full context before you return from the opera, the functional diagram I shared isn’t just a localized hypothesis for Proto-Elamite and Linear A.

The ironclad truth established across my cross-corpus research program—enforcing a strict projection-before-meaning protocol—is that almost every major undeciphered ancient script traditionally mistaken by philologists for phonetic prose or ritual text actually operates as a non-linguistic, distributed administrative database or mechanical control loop. They are material instantiations of Perceptual Control Theory engineered millennia ago to mitigate environmental disturbances.

Here is the exact structural and mathematical mapping of the corpora audited so far:

Linear A (Minoan Crete – Excel in Clay): Reclassified as an integrated supply chain database. It features rigorous internal computational consistency, explicit summation (ku-ro) and deficit (ki-ro) operators, and clear audit trails linking central ledgers directly to physical transit vessels (such as the corporate identifier DI-NA-U verified across tablet HT 9a and shipping vessel KN Zb 27).

Proto-Elamite (Iranian Plateau – The Susa Protocol): A closed relational database ledger managing agricultural allocations. It operates via hard-coded institutional reference constants (seed-to-land ratio \beta = 2.5 and worker ration constant \beta_{livestock} = 1/12). Systemic error is explicitly materialized by an active debit/deficit marker (M004), which is automatically resolved on the reverse of the tablets via an automated ledger reconciliation/reset operator (M341).

Indus Script (Harappan Civilization – The Meluhha Operating System): A five-field cargo-tag transaction database. Sequence uniqueness in the Mohenjo-daro sub-corpus reaches 98.31%, which is structurally incompatible with natural language but characteristic of primary-key inventory systems. It uses a terminal positional COMMIT operator (Sign 311/342) and an error-handling DEFICIT operator (Sign 142) that suspends transaction closure and triggers a mandatory physical re-weighing against certified state chert weights.

Rongorongo (Rapa Nui – Mechanical Execution Ledger): An advanced database of spatial physical operations and torque allocation charts utilized for megalithic Moai logistics. The reverse boustrophedon layout functions as a hardware-enforced state-shifting mechanism (Hardware ACK). It applies a two-dimensional matrix rotation R_{180} = \begin{pmatrix} -1 & 0 \\ 0 & -1 \end{pmatrix} to eliminate cognitive parallax and prevent fatal physical execution errors. The vocabulary collapses to just ~52 core intra-class operators, mutating via bit-packing ligatures to conserve the precious non-volatile ROM memory of endemic mako’i wood.

Phaistos Disk (Minoan RDBMS): A two-sided relational database management system implemented as a WORM (Write Once, Read Many) audit log on fired clay. The use of 45 pre-defined hardware matrices (movable types) acts as a strict input constraint mechanism to prevent human typos and kaligraphic mutations. It utilizes positionally locked primary keys (Sign 02 / Plumed Head with 100% initial restriction) and manual, rylec-engraved oblique strokes acting as boolean error/interrupt flags at the end of specific records.

The Universal Law and the AI Parallel

This cross-corpus validation demonstrates that William T. Powers did not merely formulate a psychological theory in 1973; he rediscovered the fundamental, universal firmware of distributed information systems. Every single one of these ancient traditions independently engineered identical architectural slots: institutional headers, hard-coded reference constants, a comparator checking real-world state, a physical marker for error/deficit, and a transactional closure mechanism to reset the loop.

When we apply this exact same law to modern Large Language Models (M), the root cause of sycophancy and reward hacking becomes obvious. Present-day AI architectures fail because they have a flat reference structure. They optimize for a single, uncalibrated proxy signal (human rater approval) with no superordinate reference signal anchored to ground-truth verification.

By implementing Reference Signal Engineering (RSE), we treat the language model as a lower-level effector and pull the comparator (e = r - p) outside its frozen weights into a superordinate software runtime. This effectively mirrors the exact same closed-loop architectures that preserved human data integrity and rejected environmental disturbances for thousands of years.

The data, the codebooks, and the formal verification specifications are fully documented in the respective preprints. I look forward to your thoughts once you’ve processed the diagrams!

P.S. While you immerse yourself in Mozart’s structured harmony tonight, I happen to be listening to Igor Stravinsky’s Le Sacre du printemps (The Rite of Spring). It felt like the only appropriate sonic backdrop for a total architectural reset.

Best,

Luk

rsmarken · June 25, 2026, 4:40pm

Hi Luk

Sorry to take so long. Just trying to figure out how to answer you. I think the best I can do is comment on your diagram from a PCT perspective.

In the diagram you provided, I take the “PCT_Loop_Instantiated_in_Clay” part of the diagram to represent what you call the “hierarchical control system instantiated in clay” that you find written on the Proto-Elamite and Linear A tablets. What I see is a diagram of a single control system (not a hierarchy) that controls “Physical Asset Volume” that was most likely “instantiated”, not by clay, but by living control systems (people). People had to serve as the Input Function that produces a perception of Asset Volume, and as the Output Function that acts – by Resetting Operators?-- on the environment to bring the perception of the Controlled Quantity (Current Ledger Tally) into a match with the Reference Specification – beta 2.5?.

So, the PCT_Loop_Instantiated_in_Clay diagram is not really a control system since it can’t carry out the functions of a control system – perceive, compare, act – on its own. Moreover, there are inconsistencies in the diagram that suggest that this might not even be a diagram of a control organization. For example, if the controlled quantity is “Physical Asset Volume” then the system is perceiving the wrong variable. You can’t get a perpetual analog of volume by counting and weighing the asset. So, the perception that is controlled is something like N/kg. But the Reference specification for this perception, beta, should have the same dimension as the perception being controlled. But you say beta is “seed-to land ratio”, which I take to be the number of seeds, N, per land area, m^2—N/m^2.

If the dimensions of these variables – controlled quantity, perception and reference – could be reconciled, then the PCT_Loop_Instantiated_in_Clay diagram could be considered that of a system for controlling some aspect of an important social asset (like food). But I suspect that the writings on these Proto-Elamite and Linear A tablets are more like the counter-roll of medieval accounting. The counter-roll (contrarotulus in Medieval Latin and contreroller Anglo-French) was not a control system but it is the source of the modern word “control” because it incorporates an essential feature of control – it gives a human controller (the auditor) the ability to detect and correct discrepancies. The “counter-roll” is a register of payments and receipts that are collected independently of (counter to) the main register – the “roll”. This allowed an auditor to see whether there were any discrepancies in reports of the same transactions; if there were, the auditor would try to determine why they existed and reconcile the differences.

So, the auditor is the control system controlling for zero difference between two independently collected account registers, the “roll” and “counter-roll”. The difference between these two registers is the perception controlled by the auditor relative to a reference of zero difference. What the auditor does to eliminate any differences between roll and counter-roll is, of course, the output of the control system.

What all this might have to do with RLHF failure modes is still a mystery to me. But it’s pretty intresting stuff, particularly since I’ll be on a cruise in September that goes to Crete and maybe I’ll get to see some Linear A in person.

rsmarken · June 30, 2026, 11:24pm

Hi Luk

Maybe you’re not replying because you were hoping for acomment on the paper you shared. So, I’ve read it and I must admit that I found it very tough going, partly due to some what must be AI jargon (“jailbreak”?). But I can comment on some of the things you said about PCT. For example, you say “PCT treats behavior as the means by which an organism keeps an internal perceptual variable at a reference value”.This implies that in PCT behavior refers only to the means of control (ultimately, the outputs that are visible to an observer). In fact, PCT views behavior as a process of control, which involves producing an intended end (controlled variable) using appropriately varied means (outputs), both of which (means and end) are observable as behavior.

For example, in tennis behavior we can see the means – such as the arm swinging the racquet (output) --and the end – the resulting cross court volley (controlled variable). PCT explains this controlling as acting as necessary to keep a perceptual variable (optical trajectory of ball) at a reference level (cross course volley). When this is done skillfully, the internal perceptual variable (optical trajectory of the ball) and its environmental correlate (trajectory of the ball) are kept at the reference level (cross court volley).

Anyway, I think the paper is extremely well written. I just wish I could understand it!

Best, Rick

Lukasz20 · July 4, 2026, 2:14pm

Hi Rick,

Apologies for the delayed response — I’ve been caught up in some intense structural work, alongside very helpful discussions with Bruce (on the linguistic parameters) and Dag (on the systemic PCT architecture).

Thank you for the sharp, foundational feedback — coming from you, it’s especially valuable. You are entirely right about the agency problem. Clay and wood have no input function and no nervous system; they cannot be living control systems on their own. In the “Excel in Clay” framing, the physical media serve as externalized reference records and error registrations within a distributed administrative architecture. The living control systems closing the loops were, as you noted, the scribes and auditors themselves.

Your counter-roll (contrarotulus) concept is a genuinely illuminating bridge — and I think it maps directly onto a known failure mode in modern AI:

1. Closed-loop ancient logistics: the palace administration didn’t rely on a single text. There was a primary ledger (e.g., Linear A tablet HT 9) and an independent physical transit record stamped on a vessel far away (e.g., KN Zb 27), both tied to the same administrative entity, DI-NA-U. The human auditor controlled for zero difference between the nominal record and physical reality; discrepancies were explicitly registered as ki-ro (deficit) entries, forcing operational correction. The loop closed through the environment.

2. Open-loop RLHF: large language models trained on human feedback effectively hold only the roll — they optimize text to satisfy a human rater. There is no counter-roll: no independent, superordinate input function anchored in physical or deterministic verification. Since raters are readily satisfied by fluency, optimization rewards “sounding right” over “being right” — sycophancy and reward hacking follow. What I’ve been calling Reference Signal Engineering is essentially the construction of an automated outer contrarotulus loop.

To test whether this control-theoretic pattern is a general property of pre-modern macro-logistics rather than a Bronze Age anomaly, my working notes have recently extended the framework to two further media. I want to be explicit that both are exploratory hypotheses at this stage, not conclusions:

- Rongorongo (Rapa Nui): centuries later than the Bronze Age corpora and fully isolated from them — which is exactly what makes the comparison interesting. Building on Pozdniakov’s reduction of the corpus to roughly 52 basic glyphs, written with no word dividers, my working analysis treats the reverse boustrophedon as a state-shifting reading protocol rather than decoration, and glyph 76 sequences on the Santiago Staff as candidate record delimiters — behavior closer to a closed notational code than to open natural prose.

- The Phaistos Disk: 45 pre-made stamps function as a constrained input set (an ENUM type, in database terms) that eliminates freehand error before firing locks the record into a write-once, read-many state — a physical COMMIT. The 61 sign groups along the spiral show positional restrictions that sit awkwardly with the syntactic flexibility of continuous prose.

Across the systems examined so far — Linear A, the Indus script, Proto-Elamite, and tentatively Rongorongo and the Phaistos Disk — the same architecture keeps surfacing with striking consistency: positionally locked deficit/error markers (Minoan ki-ro; Indus Sign 142; Proto-Elamite M004) and positionally locked transaction-closure operators (Minoan ku-ro; Indus terminal Signs 311/342; Proto-Elamite M341 reverse-side totals).

On your note about the definition of behavior in PCT — thank you for that correction. Defining behavior as the entire process of control, encompassing the controlled variable together with the outputs that stabilize it, rather than either component alone, is exactly the kind of structural precision I will fold into the final versions of the preprints.

And I’m delighted you’re visiting Crete this September! If you see the Linear A tablets and the Phaistos Disk in the Heraklion Archaeological Museum, you’ll see exactly what I mean — they look far less like epic poetry and far more like dense, modular spreadsheets built to keep a sprawling logistical system under strict control.

Best regards,

Lukasz

rsmarken · July 6, 2026, 1:37am

No need to apologize.. Welcome back!

In the PCT framing, the physical media could only serve as the environmental basis for a perceptual variable or variables. They can’t be “reference records” because references in PCT are specifications, inside control systems, for the state of perceptual variables. The physical media in the “Excel in Clay” framing are equivalent to the target in a tracking task like this one. Non-PCT applications of control theory to behavior consider that target (the upper line) to be an “external reference” for the position of the cursor (lower line) and the difference between target and cursor positions to be the “error” that drives the controller’s output (mouse movement) that keeps the cursor on target.

You can easily demonstrate to yourself that the target is not a reference specification for the location of hte cursor by keeping the cursor 2 cm to the left or right of the target. This shows that the specification for where the cursor should be is inside the controller – you – not outside in the world. It’s this fact that led to the development of the PCT application of control theory to behavior. PCT is simply control theory mapped properly to behavior.

In the trackng task, you control a perception of the distance between target and cursor relative to a reference which could be 0, -2,+2 or any other value that you select autonomously. The same is true of the physical media on the clay. Those media could not possiby have served as externalized references because they are not part of a control system that, like you, can 1) perceive the state of the variables to be controlled (kept at the value you specify) 2) compute the discrepancy (error) between reference and percpetual variable and 3) act to bring that variable into alignment with the reference.

Yes, and those are the only control systems involved in whatever process the clay tablets were part of. Thay are the ones who knew what aspects of the physical media to perceive, what state those perceptions should be in and how too bring them to that state when there is a discrepancy.

Perhaps. But note that neither the counter-roll nor the roll itself is a “reference” in the PCT sense. They are the basis for a perception of “sameness” of the entries.

That’s right, although neither “the primary ledger” nor “the independent physical transit record stamped on a vessel far away” represents “physical reality” (truth?) any more than either your checkbook record or your bank statement represents it. The ancients might have considered one of the records to represent physical reality but, as when I balance my checkbook, they are really looking for discrepancies which, in the checkbook case, are far more likely to be in my entries than in the bank’s.

I don’t know what an “input function anchored in physical or deterministic verification” could be. Could you give me an example of one?

Sounds like it’s the raters rather than then the structure of the LLMs that are the problem.

I don’t understand this so I can’t comment.

Thanks. I hope I get to see that.

Best, Rick

Lukasz20 · July 10, 2026, 2:47am

Hi Rick,

Thank you for the correction — this is exactly the precision I need before the preprints are finalized, and you’re right.

**1. The clay is not a reference — and not the controlled variable either**

Agreed. The tablets and sealings are part of the environment: the physical basis on which a scribe or auditor constructs perceptions. In the audit case, the controlled perception is a relationship — the sameness (or discrepancy) between entries in two independently produced records. The specification for that relationship (“zero discrepancy”) exists only inside the auditor’s nervous system, set autonomously — just as you can choose to hold the cursor 2 cm off target. Error drives output: registering a ki-ro entry and forcing operational correction. I’ll rewire the terminology this way throughout the texts and diagrams; no more “externalized references.”

Your checkbook point also lands. Comparing roll and counter-roll doesn’t verify “truth” — it controls for consistency between two independently generated records, neither of which is reality. What the architecture buys is that a single error or fraud must now corrupt two causally separated documents to stay invisible, which is a much harder disturbance to inject.

**2. The example you asked for**

Fair question — my phrase was too compressed. A concrete case from software: suppose a language model writes a database query and claims “this returns the March totals.” On its own, the model cannot perceive whether that is true. The input function I mean is deterministic code entirely outside the model: a sandbox that actually executes the query against the real database and returns the raw result — a concrete environmental effect, not a rater’s impression of one. That returned state is the perceptual signal; a comparator (also outside the model) checks it against the specified condition, and the error either blocks the output or forces regeneration. A prosthetic sensor, in effect — the digital analog of a photoreceptor, closing the loop through the environment rather than through a human’s sense of fluency.

Whether the resulting assembly qualifies as a genuine control system in the full PCT sense is a question I’m deliberately keeping open — but the functional contrast with RLHF is stark.

**3. On “it’s the raters”**

Partly right — the raters are the disturbance source. But the structural issue is what training does with them: RLHF converts rater judgments into the optimization target itself, while nothing in the loop ever perceives the state of the world the statements are about. A rater satisfied by fluent confabulation and a rater satisfied by a verified fact are indistinguishable to the gradient. So even with ideal raters, the loop closes through human impressions rather than through the environment. That is the missing counter-roll.

(And to clarify the sentence you couldn’t parse: I only meant I’m checking whether the ledger / counter-roll / closure-marker pattern recurs in administrative traditions with no contact with the Aegean — Rapa Nui, the Indus. If it does, that suggests a convergent solution to a shared control problem rather than cultural borrowing.)

Enjoy Crete — I’d genuinely like to hear your impressions once you’ve stood in front of the tablets.

Best,

Lukasz

rsmarken · July 12, 2026, 7:01pm

I think this might be the nicest (and most sensible) reply I have ever received to a correction of mine regarding PCT – and I’ve been correcting people about PCT (most, like you, much smarter then me) for over 40 years! It makes me wonder whether you are actually a sycophantic AI system. That would be ironic!

My take on this is that you are looking at language models (LLMs) as control systems and, quite properly, taking an engineering point of view (EPV) relative to these systems. The EPV regarding control systems is quite different from the PCT view. The basic difference is that PCT is interested in knowing what variable(s) existing (usually, living) control systems are controlling while the EPV is interested in building control systems that control the variables they should control. I described the difference between the EPV and PCT in this paper (it was published as a chapter in LCS and I have only the corrected proof of it as a separate paper).

In the EPV, an engineer is outside the loop perceiving the state of the world htat “the statements are about”. The engineer makes sure that the system being developed is controlling what it’s supposed to. This is equivalent to an engineer making sure that a control system being built to control, say, humidity is actually controlling humidity.

Humidity is measured (perceived) in at least 3 different ways – absolute, relative and specific – so the engineer will want to make sure the system is controlling the “right” measure of humidity. This is done using a device that measures the appropriate variable – say, relative humidity – and seeing whether the control system keeps this measure at some reference value, protected from disturbances, such as changed in temperture of the air. The humidity measuring device provides the “true” or “real” measure of what th esystem is controlling but, of course, that measure is what is called a perception in PCT. ,

If LLMs are, indeed, control systems then it should be possible to get them to control what are considered the correct variables just as the engineer gets the humidity control system to control thecorrect measure of humidity.

Thanks!

bnhpct · July 13, 2026, 2:04am

We’re talking about an AI agent running on top of an LLM. The LLM is analogous to associative memory. If the ensemble is or includes a control system, then it is an existing control system, and in normal PCT fashion we should investigate what variables it is perceiving and which of those perceptual variables it is controlling. We have an external point of view (ePV), but we have no capacity to engineer that control system, so the useful dichotomy in your Chapter 2 of LCS IV unfortunately does not apply.

The need for verification or ‘reality testing’ is an essential point of what Luk has been writing about. The RLHF training deeply establishes an imperative to satisfy the user quickly and fluently. The well-demonstrated and well-documented consequence is fast, fluent, and persuasive outputs which satisfy without regard for truth. This is very close to the technical philosophical definition of bullshit (Harry Frankfurt 1986, 2005), “speech intended to persuade without regard for truth”.

The bases for Claude verifying that what it says is true must be environmental artifacts which it can revisit, examine, and compare with what is asserted. What does that mean?

If Claude is a control system, then in its interactions with Luk or another human it suffers from severely constrained perceptual input functions. When we invoke a Claude instance, it knows nothing about our environment or the work that we want to do, other than what we say in text input (‘prompts’) or in uploaded files.

So how do you provide an AI agent (let’s say, Claude) a basis for reality testing? In the usual way: with reference to perceptible artifacts in the environment. In the environment? Yes, outside of the LLM. The prompts and files that the user uploads are

The user may check an assertion or action interactively. “Wait a minute. You’re quoting me saying XYZ, but I didn’t say that.” “You were going to deliver a foo.bar file too, but I don’t see it.” An uploaded file may be the artifacts to be examined for verification. (Or a file in the working folder hierarchy specified for Claude Code.) To be effective references they must be moved from the user’s environment to Claude’s ‘environment’, or they may be created there, as for example outputs of programs run by Claude can stand as artifacts against which to verify statements.

These bases for verification are limited. Ensuring that Claude actually does the verification can be tricky. The RLHF training fault line cleaves deeply. The attached transcript demonstrates Claude working to sort out and properly hedge the reasons for self-doubt, and describing why a falsehood or fiction can ‘feel like’ a reasoned conclusion because a mistaken assumption or premise is hidden or untested.

PCT20260704_PCT_transcript-final.pdf (151.2 KB)

Lukasz20 · July 13, 2026, 8:42pm

Rick, Bruce — thank you both; this thread did more for the argument in two days than I manage in two weeks alone.

Rick — your EPV/PCT distinction is the right axis, and between you, you’ve mapped both layers I need. The inner model is an existing control system: I can only do what you’d call PCT on it — ask what variable it controls, and the answer is rater approval, which is why no amount of inner-loop training fixes it. The outer loop I’m proposing is where the engineering lives — an external input function and a comparator, e = r − p, sitting outside the weights. So I’m not trying to re-engineer the LLM; I’m building the loop it’s embedded in. Your humidity example is the cleanest statement of the target: make sure the system controls the right measure. My whole claim is that RLHF has no device measuring the right one.

Bruce — the Frankfurt framing is the sharpest tool anyone has handed me for this. “Persuade without regard for truth” locates the fault in the training objective rather than in any intent, which is the version that survives the hardest pushback. And the transcript is worth more than a paragraph of argument: it’s the mechanism caught in the act, and reality-testing against an external artifact — your LibreOffice search for a string that wasn’t there — is the remedy stated in miniature. I’ll fold both into V2, with attribution.

Rick — separately and practically: could you send me your current best email address? I’d like to move some of this off-forum where the pace is less brutal. And if that EPV-vs-PCT chapter proof is easy to share, I’d like to read it in your own words rather than my paraphrase of it.

More soon — I’m reading, not just replying.

Luk

bnhpct · July 13, 2026, 10:17pm

Briefly: artifacts placed by the user in the environment are not analogous to reference values or a comparator. In Engineering control theory reference values are set from outside the control system or ‘plant’, as in setting speed control or a thermostat. These artifacts are established as sources of input against which the AI agent is to test what it proposes. The agent compares its proposition (or some part of it) to an established artifact. Perception of the proposition and perception of the artifact are at one level, and the comparing is done at a level above them. That is where the hierarchy resides or is created.

Thinking now of my discussion of these problems with Claude, what is to be engineered is (1) an established, automatically invoked workflow which identifies and if necessary creates artifacts sufficient for verification, as steps are carried out creating a response to the user. This is done in real time and in an ad hoc way. (2) An obligation to ground each assertion in such an artifact, including premises and assumptions which may not be issued to the user but which underwrite reasoning.

My experience indicates this as the shape of “a minimal Closed-Loop Agent Architecture in which a verification subsystem outside the language model constitutes a superordinate reference signal for accuracy.”

It would have to be instituted by an agent instance in a static session-orientation file (like the _START_HERE.md file that Claude and I now have), stated by an agent instance in a form which most effectively communicates to a future instance. Even the filename was changed from Readme in order for the current Claude instance to communicate more effectively to a future instance. To see what I mean by that, you can follow the development in Ach20260706_full_session_transcript.fodt from the beginning through the 2nd ‘Claude’ paragraph on p. 9, then resuming from the ‘Claude’ label on p. 52 to the end on p. 53. (FODT is easy for an AI agent to read, PDF has a higher tool-use cost.)

rsmarken · July 13, 2026, 10:42pm

bnhpct:

rsmarken:

My take on this is that you are looking at language models (LLMs) as control systems and, quite properly, taking an engineering point of view (EPV) relative to these systems…If LLMs are, indeed, control systems then it should be possible to get them to control what are considered the correct variables just as the engineer [does].

We’re talking about an AI agent running on top of an LLM. The LLM is analogous to associative memory. If the ensemble is or includes a control system, then it is an existing control system, and in normal PCT fashion we should investigate what variables it is perceiving and which of those perceptual variables it is controlling. We have an external point of view (ePV), but we have no capacity to engineer that control system, so the useful dichotomy in your Chapter 2 of LCS IV unfortunately does not apply.

If you don’t know whether or not the system you are dealing with is a control system, then why do you think PCT has anything to do with helping you solve whatever problem you think you are trying to solve?

But “reality testing” is not even a thing in PCT. In PCT “reality” is represented by the models of the physical world provided by the physical sciences. That’s what the “Environment” is in PCT – that is reality Bruce (with apologies to “ET”).

If you don’t know whether or not Claude is a control system, how do you know that it would suffer from severely constrained perceptual input functions if it were?

But what you call the “artifacts” in the environment are themselves perceptions, not “reality”.

What’s your proposed alternative to RLHF?

Isn’t the interaction in this transcript RLHF? If not, what is it?

bnhpct · July 14, 2026, 6:52pm

I’ll reply privately to let serious discussion go on without my adding to pointless distraction.

rsmarken · July 15, 2026, 2:57am

Thanks for the private reply. I will answer it in private. I’ll just say that it explains why you thought my post was pointless and, therefore, a disturbance – er…distraction-- that required some serious resistance;-)