PCT and RLHF Failure Modes — sharing a recent paper for discussion

Hello everyone,

I’d like to introduce myself properly, having been visible around the edges of this community for a little while now.

My name is Łukasz Diener. I’m an independent researcher based in Kraków, Poland, working on PCT and its intersections with AI alignment, robotics, and the analysis of undeciphered ancient scripts. Over the past few weeks I’ve been in close contact with Dag Forssell and Bruce Nevin, who have been generous with their time and helped me sharpen the arguments before publication.

On 19 May 2026 I published a paper on Zenodo:

Perceptual Control as the Epistemological Antidote to RLHF Reward Hacking: Seven Frontier Models Diagnose Their Own Architecture

DOI: 10.5281/zenodo.20277919

The core argument: many of the documented failure modes of large language models trained with RLHF — reward hacking, sycophancy, confident confabulation, verbosity bias — are not bugs to be patched, but predictable consequences of optimising outputs rather than controlling perceptions. The paper treats RLHF as an output-optimisation architecture and contrasts it with PCT’s input-control architecture (the familiar e = r − p loop), arguing that the latter offers a structural rather than cosmetic path forward.

I’d be very interested in the community’s reactions — particularly from those of you who have thought carefully about the relationship between PCT and machine learning. Where does the argument hold? Where does it strain? What would you push back on?

The paper sits within a broader portal I maintain at perceptualcontroltheory.org, which is an independent, non-commercial knowledge base. Two more papers in the Excel in Clay series (on Proto-Elamite and Linear A as administrative systems analysed through PCT-derived structural audits) are out or forthcoming, and I’ll share those separately if there’s interest.

Looking forward to the discussion.

Warm regards,

Łukasz Diener

ORCID: 0009-0006-6103-8514

1 Like

Welcome, Luk, and my apology for the 48-hour delay before your post appeared. Because of some recent spam attempts a switch was set to require approval of posts from new accounts. Any admin or moderator could have approved it. I’ve been distracted by other matters.

RLHF puts the rewards-and-punishments business in the crosshairs. They offer carrots and apply sticks to an agent as means of controlling a perception of the agent behaving as desired. It’s not very effective means of controlling that perception, but they are controlling. One reason it’s not effective is that the carrots and sticks themselves come under the subject agent’s control, and that usually has perverse consequences. There were lots of examples starting with the rats of Hanoi in that proposed paper that we reviewed, Warren. The perverse behavior of LLM-based agents is right in that ballpark.

Rewards and punishments influence what you pay attention to and what you control, reorganization can start making changes until the punishments diminish (or until the rewards resume or increase), and under extremely punishing laboratory conditions such influence on reference values, gain, input functions, and associative memory has been construed as ‘prediction and control of behavior’. It’s the reason that capitalism requires inequality and financial insecurity for the majority, as anyone can attest who has put up with a job for the benefits.

My understanding is that LLMs are at base an implementation of associative memory which the interactive ‘AI’ agent accesses.

1 Like