While impressive, there are still numerous challenges with current state of the art Large Language Models (LLMs) when applied to global realtime news coverage. LLMs easily lose attention and their performance is extremely sensitive to how well matched the writing style of the source material is with the prompts being posed to the model. When source material and prompts are well aligned, the results can achieve human-like fluency, while with more complex and detailed material, spoken word transcripts, and mismatched writing styles, the results can be far poorer. Attention loss is a common challenge, while hallucination occurs readily and without easily-accessible cues or standardized detection tools or workarounds. In short, the same prompt can yield flawless results on one article, while hallucinating a completely unrelated result on another article, without any standardized indicators of confidence, drift, or attention loss. Current generation LLMs express consistent confidence, meaning even under conditions of uncertainty or total hallucination, the models will yield strongly confident results with authoritative structure. Despite their immense training datasets, they can also be swayed by linguistic differences and turns of phrase, including terms common to specific fields but less common on the open web.
A unique and poorly-understood consideration is the strong overlap between global news content and societal fault lines and the importance in many use cases of encoding what a specific government or organization is saying verbatim (encoding its harmful narratives as they are stated) in order to identify the emergence and sources of misinformation, false and harmful narratives, etc. Special consideration may need to be given to the confounding impact of content moderation and "conversational health" filtering increasingly being applied to public-access models and may require specialized access pathways or customized LLM models. For example, analysis of Iranian state media necessitates that the underlying LLM does not attempt to rephrase or exclude Iranian state narratives about women's role in society that may deviate sharply from accepted Western norms, since the very intent of examining Iranian media is to codify how the state is framing its systematic repression of women's societal roles and rights. One possible approach that has shown success with previous generations of moderated LLMs is the use of adaptive adversarial preprocessing that replaces sensitive terminology with ones unfiltered by the model, with confounding metrics (the real presence of those same terms in the article) being used to select among various mapping schemas and constant canary filtering to adapt the models. In some cases, adversarial probing is required for such remapping, while in other cases the model itself may be asked directly to construct a remapping. Such work is highly experimental at this stage and will require further discussion with LLM vendors as such filtering work expands dramatically with the rise of public access and publicity around LLMs.
Current LLMs are not yet at the stage where the same exact template can be applied uniformly to an infinite pipeline of articles in a fully unsupervised stream processing workflow. Instead, templates work on some articles and fail on others, with their success appearing to heavily depend on the alignment between text and prompt. At the same time, this was historically the same challenge faced by the earliest generations of grammar-based codification systems, which evolved over years and decades to match more and more of the contexts they encountered. It is likely that through a combination of prompt engineering and the incorporation of a library of tailored prompts customized for different textual makeups, LLM workflows can be sufficiently refined to work around these limitations.
Generative tools like LLM are currently designed for small-in large-out workflows, in which small prompts can generate infinitely large outputs, such as asking ChatGPT to compose an entire novel with a single input sentence. The converse which is the focus of news analysis is not yet an optimized workflow: massive input to small output. For example, asking an LLM to consume the totality of realtime daily news coverage on the EV market and summarize the major developments of the day, updated every few minutes, will typically either yield an error or exhaust the model's attention capacity. Even while they can encode vast amounts of information from their training datasets, most public LLMs are not optimized for similarly large realtime input, requiring instead workarounds such as cascading summarization and external knowledge stores. It is likely that these limitations will continue to be reduced given the pace of development, though there are some significant architectural challenges to true unbounded realtime knowledge updating and the underlying conflict and ambiguity resolution that may take time for the research community to fully address.
Most importantly, given the free and subsidized access to models like ChatGPT, companies have yet to fully appreciate the actual cost of ownership and use of the coming generation of LLM models. Their immense hardware requirements and cost of inference means they may not be fully suited for at-scale processing of infinitely scalable realtime streams like GDELT's. Instead, the architecture we currently recommend is the use of lightweight relevancy filtering models that use embeddings or "mini models" to determine relevancy at the article level using our Ngrams 3.0 dataset, then crawling the article themselves, filtering to the relevant portions of the text and applying LLM's only to those portions of the text.
Despite these limitations, LLMs are already being widely applied to GDELT's realtime streams, but we hope greater awareness of the unique limitations of current generation LLMs may help accelerate such work by helping the community better understand best practices and workarounds for the unique characteristics of planetary-scale realtime data streams and LLMs.