ChatGPT Experiments: Autoregressive Large Language Models (AR-LLMs) And The Limits Of Reasoning As Structured Summarization

As we continue to expand our rapidly growing archive of ChatGPT experiments, we can see that, like Whisper, the panacea of the breathless hype and hyperbole of the tech community gives way to a far more nuanced and limited reality of the actual capabilities of these models. Autoregressive Large Language Models (AR-LLMs) like ChatGPT offer at first glance what appears to be human-like reasoning, correctly responding to intricately complex and nuanced instructions and inputs, compose original works on demand and writing with a mastery of the world's languages unmatched by most native speakers, all the while showing glimmers of emotional reactions and empathic cue detection. The unfortunate harsh reality is that this is all merely an illusion coupled with our own cognitive bias and tendency towards extrapolating and anthropomorphizing the small cues that these algorithms tend to emphasize, while ignoring the implications of their failures.

OpenAI's Whisper offers a classic example of this bias. Within the early adopter tech community, its human-like fluency was mistaken for a new bar in accuracy, rather than as the simple enhancement of traditional deterministic ASR with non-deterministic rewriter that did not represent an immense leap in ASR accuracy so much as it simply created a new interface that made it appear that much more accurate – much as a glossy beautiful user interface can make even the most error-riddled application appear pristine. Yet, even as the tech community rushed Whisper into production in myriad applications, few even paid lip service to the implications of its novel architecture, while fewer still ran the kinds of extensive real-world experiments to demonstrate the immense dangers of generative output.

In many ways, Whisper's success mirrors our own work on the application of then-large language models to OCR a decade and a half ago. Simply pairing existing OCR systems with the LLMs of the day transformed their error-riddled output to near-human accuracy, even on severely degraded texts where large portions of the scan image were entirely illegible or even missing, by simply using a handful of well-formed words as "anchors" and interpolating the rest of the text through the generative purely statistical LLMs of the time. The results were nothing short of groundbreaking, far outperforming even the best human scholars and capable even of OCRing texts for which whole portions of the page were lost. Yet, rather than rushing the tool into autonomous use, we discovered a key danger: while the underlying models were thematically and temporally anchored (trained on specific genres and time periods to encode the distinct language and structure of that content), in the end it wasn't improving the OCR process, it was merely papering over the output with beautiful generated prose "inspired" by the OCR. In essence, it was the equivalent of handing a historian a book page with just a handful of randomly selected words scattered on the page and asking them to fill in the missing text based on their understanding of the topic and time – the results may flawlessly match the original material or they may differ in existential ways that entirely change the meaning of the text.

This is what we find with Whisper – run multiple times over an input video, it doesn't just change its output in small ways – the entire meaning of the text can change each time, with whole passages disappearing or repeating or complete and utter hallucination. In the case of our OCR work a decade and a half ago, we ultimately deployed it as a human assistant, providing a recommended transcription of a page that a human SME could then review. In particular, we offered a "conservativeness vs recovery" dial: for the typical use case of OCRing ordinary material, the models were extremely conservative, requiring maximal anchoring for their interpolations. Only for the most heavily damaged texts where no remaining evidence of the original text survives would the dial be turned all the way to "recovery" model where the model essentially wrote its own text to backfill as a way of suggesting to the historian possible ideas of what might have been written there as a set of starting points they could then explore via other sources for verification.

While those early efforts relied on statistical LLMs, today's neural LLMs suffer from precisely the same dangers.

Autoregressive LLMs like ChatGPT appear at first glance to be capable of advanced abstract reasoning, emotional response and creativity. Yet, a closer look demonstrates that, like our OCR efforts all those years ago, language is so predictable when modeled at scale, that simply anchoring a prompt and then extrapolating forward based purely on token statistics can yield mesmerizing results that say far more about our limits of creativity and the mutual-intelligibility requirements of communication than they do about the capabilities of AI.

In essence, today's LLMs like ChatGPT approach all problems as a sort of surface summarization / token distillation task in which the prompt becomes an anchor, the knowledge store the encoding of the statistical patterns of language, and the "reasoning" performed by the model becomes simply the structured summarization of the prompt. A request to codify text into structured form looks far less like a guided extractor or even the minimal output of dependency tree-based extractor and simply like the extraction of the most similar clauses glued together and rewritten, with a strong tendency towards over-verboseness.

Moreover, the specific manner in which AR-LLMs like ChatGPT function means they suffer from existential attention loss, fixation, hallucination, constant confidence and a wealth of other challenges that make them particularly ill-suited for the domain of news. While summarization (distilling a long passage of text into a shorter passage) is a common news task that LLMs are good at, their attention, fixation and hallucination pose challenges even for this most basic of tasks that, in the absence of meaningful confidence metrics, make LLMs difficult to employ in mission-critical domains. Worse, when Q&A, codification and look-across tasks are instantiated as surface rather than semantic distillation tasks, they become especially brittle, non-deterministic, sensitive to prompt-text mismatch and hyper-prone to unpredictable catastrophic failure.

If we think of AR-LLMs as summarization tools and their prompts not as instructions to an intelligent agent, but rather as anchors for that summarization task, we can be far more effective in employing these tools, much as we have learned to leverage the limitations of keyword search to "guide" search engines towards the specific results we are after. In other words, in the more than half-century since commercial keyword search became available, we have taught generations of users how to think and express their queries in terms of linguistic anchors that best align with the information they are seeking. In similar fashion, treating AR-LLM prompts as summarization anchors rather than elaborate instructions can similarly best help guide their outputs by best "embedding" the summarization task within both the model's knowledge store and the scope of the task represented by the prompt.

AR-LLMs are exceptionally well-aligned with creative tasks like text generation due to this anchoring process – a prompt to ChatGPT asking it to write a particular story in a particular style essentially anchors its summarization task in those regions of its statistical representation of language, yielding generated text that appears to eerily follow the user's request, much as traditional embedding appears to eerily understand that a "nanochip" is the same as a "microchip" in a specific context. The most appropriate application of AR-LLMs today is as the textual compliment of image generators. Much as tools like DALL-E, Stable Diffusion and their ilk are visual creative assistants, ChatGPT at its best is a textual creative assistant, automating the creation of prose for different tasks.

As we look to the kinds of challenges that are of the greatest priority for global news analysis, few are well-aligned with this kind of "structured surface summarization" (3 S's) approach to reasoning. With prompt engineering that better aligns instructions with this kind of execution model, it will be possible to better leverage the strengths of models like ChatGPT, but ultimately fundamentally new approaches to knowledge encoding and reasoning will be required for news analysis.

Though, based on our own growing library of experiments, perhaps for now the best answers may lie not in the "flashy new thing" of AR-LLMs, but in a return to simpler tools that leverage generative models only for rewriting their outputs, much as we did a generation ago with OCR.