Frontier AI Grand Challenge Problems: Corpus-Scale Reasoning Over A Global 200GB 150+ Language Archive

The most powerful generally available production AI models today max out at around 1-2M total context window tokens, with an output maximum of typically just a few tens of thousands of tokens. Given a very rough conversion rate of around one token per four characters, this works out to around 8 million characters of total input when limited just to text, while the conversion rate for longtail languages is often in practice far less. This input/output limitation means that even today's most advanced research and production SOTA models are designed around the idea of input filtering using RAG, managed keyword search and other inputs or cascaded summarization in which larger inputs are iteratively distilled down. Such use cases severely constrain how today's AI models function: even the most advanced "deep research" models must rely on distilling the user's query into a set of searches that return a tiny number of results that the model then considers and summarizes. Cascading summarization typically induces existentially severe hallucination, while filtration distillation excludes far too much fine detail that can fundamentally change interpretation. In short, today's AI systems are limited to domains and applications in which they can distill down vast corpuses like the web into tiny miniature datasets for analysis: the entire web down to a small number of search results. There is simply no AI approach today that can allow an AI model to reason at the corpus scale over vast archives of material.

Our interest lies in precisely the opposite: given a corpus of nearly 200GB of text spanning 30 billion words totaling 160 billion characters in 150+ languages spanning a quarter-century of global human society, limiting to text alone this would require a more than 40-billion-token context window based on current rates. Add in 16 billion seconds of imagery and the token requirements become absolutely breathtaking. Today such a corpus can be examined only through RAG, keyword-based and other distillation methods. Even then, the kinds of questions we can ask of it are incredibly limiting: while it is possible to ask how inflation was covered today on a few news channels, it is impossible even to ask how inflation was covered globally in a single day, let alone examine a longer time horizon like weeks to months. Absolutely impossible is the ultimate end goal of a journalist or scholar being able to ask "what happened on Planet Earth over the past quarter-century?" – such queries require understanding the entire corpus at its original scale and detail level, rather than downsampling or filtering it.

Thus, to us, one of the most critical Frontier AI Grand Challenge Problems is "corpus-scale reasoning": the ability to consider entire vast archives of inputs WITHOUT filtration, distillation or downsampling. For example, imagine a journalist, scholar or policymaker being able to ask a question like "how has inflation been contextualized around the world over the past quarter-century? what are the kinds of visual and textual themes used to illustrate it and how have those evolved and differed across countries and time?" or "How has the Russian invasion of Ukraine been portrayed and contextualized across the world over the past three years?" Even questions of this scale require unimaginable context windows to answer. Yet, the most important questions of all revolve around true corpus-scale understanding: "How has the world changed over the past quarter-century?"

We would love to hear from researchers working on extremely large or unbounded window models and other approaches that allow models to reason at corpus-scale.

The GDELT Project

Frontier AI Grand Challenge Problems: Corpus-Scale Reasoning Over A Global 200GB 150+ Language Archive

Archives