ChatGPT Experiments: The Limitations Of Structured Summarization Versus Codification & Realtime Versus "Known" Domains

As we've been exploring ways Large Language Models (LLMs) like ChatGPT can help us better make sense of the vast infinite realtime landscape of news, our growing body of experiments suggests several major limitations for current generation LLMs that complicate their use:

  • Small Input Large Output. The first is the purely technical limitation that current LLMs are designed for small in/large out operation, in which a small prompt yields unbounded output. A single-sentence prompt can generate a 1,000-page manuscript without issue. In contrast, news tasks typically involve distillation and codification, in which the purpose of the LLM is to examine extremely large inputs and distill them down into very small outputs. For example, summarizing all of the news coverage about a breaking event, constructing a live catalog of protests, Q&A over an entire industry: all of these tasks involve the LLM consuming potentially millions of words of content in realtime to generate small concise outputs. While LLMs inherently distill vast volumes of training data, current generation LLMs do not permit the kind of realtime knowledge store updates required of these tasks. Instead, most current production LLMs approach this problem through using an entire knowledge store and directed search (keywords/embeddings/etc) to identify relevant textual passages within their input limits. In contrast, news distillation tasks often require lossless preservation of minute detail that is often lost in either textual selection or cascading summarization workarounds. New LLM architectures will be required that permit unbounded input.
  • Non-Deterministic Output. The nondeterminism of LLMs are their most desired trait for many kinds of generative tasks, but in many news codification tasks, this nondeterminism conflicts with the need to codify consistently at scale using a common template. For example, a workflow designed to mine global news media in realtime for protests and to record them in a common database must extract every protest description in precisely the same way, with the same fields, same definitions and same structure. While prompts can be used to successively constrain LLM output, even the most restrictive prompts struggle to rigorously enforce templated extraction within the bounds needed for fully unsupervised templated codification.
  • Regurgitation Versus Reasoning. The most impressive examples from ChatGPT thus far have been ones for which there is considerable overlap between the task and ChatGPT's training set. For example, asking it questions that follow a structure commonly found on the web about facts that are widely discussed on the web yields human-like results. In contrast, asking ChatGPT to analyze novel content that is highly disjoint from the factual universe of its 2021 training snapshot or to perform tasks whose outputs deviate substantially from that which is found widely on the web yields noticeably degraded results. In essence, LLMs are extremely good at codifying the collective patterns of language found across their training datasets and then layering prompts onto those patterns, but when it comes to novel content for which the prompt and domain are more and more disjoint from the training dataset, the outputs are far less desirable. This is a very real limitation of LLMs – far from truly synthesizing and "understanding" content, they are essentially pattern matchers and generators. In many ways, looking at the output of ChatGPT, its strengths and weaknesses are nearly identical to the "large" (by the standards of the era) language models we used for distillation and reasoning tasks a quarter-century ago, in which we similarly crawled the open web and fed the resulting massive text archives covering the web itself into immense statistical models running on supercomputers that could perform similar kinds of summarization, generation, codification and Q&A tasks by aligning the input prompts and tasks to the language graphs it had generated. Asking it to answer a question for which there were many examples on the web would result in near-human results, while asking it to to reason over entirely novel domains yielded results that drew from its induced knowledge store of similar content by positioning the task within what would today be called an embedding space, but which would yield poor results due to the model using only its understanding of language, rather than concepts – something ChatGPT suffers from a quarter-century later. Similarly, our large language models of a decade and a half ago could generate similarly fluent prose when correcting OCR error, by using confident words and letters as markers and drawing from a thematically and temporally-defined massive language model to guide reconstruction of fluent prose to fill in the holes. In essence, LLMs will achieve their best results on the kinds of common questions typically posed to "assistants" and simple web searches, while the kinds of novel reasoning tasks required of news analysis pose a much greater challenge.
  • Encoding Language Not Concepts. Current LLMs encode the statistical patterns of language, rather than a working model of the concepts that language represents. They may encode that "infection" appears the majority of the time near disease-related terms and is often associated with a numeric quantity. In this way, the model can appear to "understand" text when asked about the spread of a disease. Yet, when encountering "cases" instead of "infections" – for which its internal model has less of a statistical association – it can fail to associate them. Importantly, reasoning tasks that can be achieved purely through linguistic manipulation can be answered typically reasonably well, while those that require more than mere associative closure will typically fail. This is why they perform exceptionally well on the bounded task of summarization, while robust codification eludes them.
  • Structured Summarization Versus Codification. In practice, the outputs of current LLMs like ChatGPT resemble what might be called "guided structured summarization" rather than codification. In essence, the LLM effectively treats each task as a summarization problem, using the prompt to guide its summarization and allowing for arbitrary structure of that summarization beyond simple text output. Even code generation is a summarization task. Codification is inherently a distillation problem and thus should be well-suited for the summarization approach of LLMs, but robust codification requires semantically understanding the concepts being conveyed by text, rather than mere statistical correlations among tokens.