The GDELT Project

Lessons Learned From Building Global Platforms For Diverse User Communities: From Processor To Platform

Earlier today we explored some of the lessons we've learned building global platforms for diverse user communities. In the same vein, it is worth revisiting the progression and evolution of GDELT's approach to its datasets.

At the highest level, GDELT has evolved from performing all analytic processing itself to providing richer base metadata upon which others can build their own pipelines. GDELT's original datasets provided only minimized extracted and computed metadata like events, entities and geography. This followed the following overarching progression.

GDELT As Processor

GDELT As Platform

As GDELT finds ever-greater application in massive complex analytic workflows, especially the growing array of SOTA neural models running on large accelerator clusters, we have increasingly released datasets that enable more of the analysis to be performed by the end application, rather than GDELT itself.

Our first in this series is the 150-language Web NGrams 3.0 dataset, which provides a realtime multilingual word frequency KWIC stream that can be used to perform any kind of bespoke downstream analytics on top of GDELT.

For example, SOTA very large model multilingual neural entity disambiguation models can require entire accelerator pods and have unpredictable inference times, making them difficult to implement at GDELT's global scale in realtime. Instead, such applications can simply ingest the ngrams dataset in realtime and process it directly themselves in their own hardware environment, such as streaming the ngrams into a large model running on a GCP TPU pod.

Increasingly, GDELT will be blending these two worlds, providing both precomputed metadata and foundational datasets like ngrams that can form the foundation of bespoke analytic workflows.