EWA2020: Reimagining Web Archiving As A Realtime Global Open Research Platform: The GDELT Project

Kalev will be speaking at the inaugural Engaging With Web Archives (EWA) 2020 conference, held at the National University of Ireland Maynooth, presenting "Reimagining Web Archiving As A Realtime Global Open Research Platform: The GDELT Project":

The GDELT Project (https://www.gdeltproject.org/) is a realization of the vision I laid out at the opening of the 2012 IIPC General Assembly for the transformation of web archives into open research platforms. Today GDELT is one of the world’s largest global open research datasets for understanding human society, spanning 200 years in 152 languages across almost every country on earth. Its datasets span text, imagery, spoken word and video, enabling fundamentally new kinds of multimodal analyses and reach deeply into local sources to reflect the richly diverse global landscape of events, narratives and emotions.

At its core, GDELT in the web era is essentially a realtime production research-centered web archive centered on global news (defined as sources used to inform societies, both professional and citizen-generated). It continually maps the global digital news landscape in realtime across countries, languages and narrative communities, acting both as archival facilitator (providing a live stream of every URL it discovers to organizations including the Internet Archive for permanent preservation) and research platform.

In contrast to the traditional post-analytic workflow most commonly associated with web archival research, in which archives are queried, sampled and analyzed after creation, GDELT focuses on realtime analysis, processing every single piece of content it encounters through an ever-growing array of standing datasets and APIs spanning rules-based, statistical and neural methodologies. Native analysis of 152 languages is supported, while machine translation is used to live translate everything it monitors in 65 languages, enabling language-independent search and analysis.

Twin global crawler and computational fleets are distributed across 24 data centers across 17 countries, leveraging Google Cloud’s Compute Engine and Cloud Storage infrastructures, coupled with its ever-growing array of AI services and APIs, underpinning regional ElasticSearch and bespoke database and analytic clusters and all feeding into petascale analytic platforms like BigQuery and Inference API for at-scale analyses. This massive global-scale system must operate entirely autonomously, scale to support enormous sudden loads (such as during breaking disasters) and function within an environment in which both the structure (rendering and transport technologies) and semantics (evolving language use) are in a state of perpetual and rapid change.

Traditional web archives are not always well-aligned with the research questions of news analysis, which often require fixed time guarantees and a greater emphasis on areas like change detection and agenda setting. Thus, GDELT includes numerous specialized news-centric structural datasets including the Global Frontpage Graph that catalogs more than 50,000 major news homepages every hour on the hour, totaling nearly a quarter trillion links over the last two years to support agenda setting research. The Global Difference Graph recrawls every article after 24 hours and after one week with fixed time guarantees to generate a 152-language realtime news editing dataset cataloging stealth editing and silent deletions. Structural markup is examined and embedded social media posts cataloged as part of its Global Knowledge Graph. A vast distributed processing pipeline performs everything from entity extraction and emotional coding to SOTA language modeling and claims and relationship mapping. Images are extracted from each article and analyzed by Cloud Vision, enabling analysis of the visual landscape of the web. Datasets from quotations to geography to relationships to emotions to entailment and dependency extracts are all computed and output in realtime, operating on either native or translated content.

In essence, GDELT doesn’t just crawl the open web, it processes everything it sees in realtime to create a vast archive of rich realtime research datasets. This firehose of data feeds into downloadable datasets and APIs to enable realtime interactive analyses, while BigQuery enables at-scale explorations of limitless complexity, including one-line terascale graph construction and geographic analysis and full integration with the latest neural modeling approaches.

Full integration with GCE, GCS and BigQuery couples realtime analysis of GDELT’s rich standing annotations with the ability to interactively apply new analyses including arbitrarily complex neural modeling at scale. This means that GDELT is able to both provide a standing set of realtime annotations over everything it encounters and support traditional post-facto analysis at the effectively infinite scale of the public cloud.

From mapping global conflict and modeling global narratives to providing the data behind one of the earliest alerts of the COVID-19 pandemic, GDELT showcases what a research-first web archive is capable of and how to leverage the full power of the modern cloud in transforming web archives from cold storage into realtime open research platforms.