Lessons Learned From Building Global Platforms For Diverse User Communities: From Processor To Platform

Earlier today we explored some of the lessons we've learned building global platforms for diverse user communities. In the same vein, it is worth revisiting the progression and evolution of GDELT's approach to its datasets.

At the highest level, GDELT has evolved from performing all analytic processing itself to providing richer base metadata upon which others can build their own pipelines. GDELT's original datasets provided only minimized extracted and computed metadata like events, entities and geography. This followed the following overarching progression.

GDELT As Processor

  • Daily Event Data. GDELT's first dataset was an event dataset updated daily, which itself was already too fast of an update cycle for most academic users of the time.
  • Realtime Event Data. As GDELT rapidly moved beyond academic research to real-world production applications, it moved from daily updates to realtime processing, inaugurating its 15-minute update cycle.
  • Local News Around The World. As part of its transition to realtime news, GDELT invested heavily in cataloging small local and regional news outlets across the world. The academic community was initially extremely resistant to the idea of moving from its historic legacy of monitoring just a handful of US-based newspapers to monitoring local news from across the world. In fact, for several years, the academic community openly touted as vastly superior any research that relied exclusively on the New York Times to catalog global events as opposed to the use of local sources in those countries, with the nearly ubiquitous refrain that if an event wasn't reported in the New York Times, it simply didn't matter. Legacy hand-curated datasets advertised that their strict reliance on a handful of US newspapers made them vastly superior to datasets that relied on local news coverage and cited their vastly reduced coverage of events outside the US and Europe as proof of their superiority. Yet as GDELT almost overnight became the defacto dataset for real-world event cataloging and the existential importance of local sources became ever more apparent and ever more difficult for the academic community to dismiss, GDELT's investments in local news monitoring made the concept of using local sources to understand local events standard practice.
  • Knowledge Graph. While the academic and governmental social science communities had historically focused primarily on physical events, an early GDELT insight was the criticality of latent narrative and emotional assessment and the degree to which these could help forecast future conflict and instability. Similar to GDELT's use of local sources, there was significant pushback and dismissal of such approaches by the academic community when GDELT first debuted, with the argument that narratives play no role in conflict, much as the academic community largely dismissed the role of protest movements in macro-scale instability. Yet, as with local news, GDELT's investments in these areas have made them standard practice today for production conflict forecasting.
  • Moving Beyond English. In the aftermath of the 2014 Ebola outbreak and the discovery that GDELT had monitored the earliest glimmers of the outbreak to come, GDELT Translingual was launched that live-translated all coverage GDELT monitored in 65 languages. As with local news and latent narratives, there was immense and intense pushback from the academic community to the idea of looking beyond English language news coverage and strangely, significant pushback from many sectors of government as well. As before, research that relied exclusively on English language coverage was touted as vastly superior to research that relied on local sources in local languages. As with local news and knowledge graphs, GDELT's pioneering use of massively multilingual analysis has standardized the use of content beyond English in today's workflows.
  • Expanded Graph Universe. Over time, the base Knowledge Graph joined an array of specialty graphs cataloging geography, frontpages, quotations, entities and other types, representing the daily universe of news coverage in terms of specialized graphs.

GDELT As Platform

As GDELT finds ever-greater application in massive complex analytic workflows, especially the growing array of SOTA neural models running on large accelerator clusters, we have increasingly released datasets that enable more of the analysis to be performed by the end application, rather than GDELT itself.

Our first in this series is the 150-language Web NGrams 3.0 dataset, which provides a realtime multilingual word frequency KWIC stream that can be used to perform any kind of bespoke downstream analytics on top of GDELT.

For example, SOTA very large model multilingual neural entity disambiguation models can require entire accelerator pods and have unpredictable inference times, making them difficult to implement at GDELT's global scale in realtime. Instead, such applications can simply ingest the ngrams dataset in realtime and process it directly themselves in their own hardware environment, such as streaming the ngrams into a large model running on a GCP TPU pod.

Increasingly, GDELT will be blending these two worlds, providing both precomputed metadata and foundational datasets like ngrams that can form the foundation of bespoke analytic workflows.