Web Archives As Digital History: Methodologies, Workflows And Technological Needs

Within GDELT's vast archives lie decades of global human history. A library of open datasets spanning more than 8 trillion datapoints are available to researchers as downloadable datasets and through various APIs, query services like BigQuery and non-technical interfaces like the TV and TV AI Explorers offering rich annotations over billions of online news articles spanning 20 years and more than 150 languages, half a billion worldwide news images, a decade of television news over portions of 150 channels and half a decade of radio news covering portions of 550 stations. In short, GDELT is a massive archive spanning the leadup to, earliest glimmers of and longitudinal course of global events from conflict to pandemics.

Such massively multilingual and multimodal digital archives require profoundly new methodologies, workflows and technologies to fully leverage their insights.

Automated and machine-assisted analyses require fundamentally new methodologies and approaches through which to understand their results. Replacing the "close reading" of traditional historical research with the "distant reading" of at-scale automated assessments offers both unprecedented opportunities and exceptional challenges for how to reconcile such results with traditional understandings of historical research.

Profoundly new forms of analytic workflows are required even for human assessment of such archives due to their scale. New forms of user interfaces like the Television News Visual Explorer showcase the potential for innovative machine-assisted human analysis and historical research.

Finally, few scholars have the technical infrastructure to construct maps from tens of billions of coordinates, examine graphs totaling trillions of edges or sift realtime patterns from petascale archives. What are the new kinds of technological workflows and infrastructure required to enable at-scale analyses of archives that lack historical precedent in their size and scope and are orders or even hundreds of orders of magnitude larger than the largest datasets that have historically existed in many disciplines?

We are embarking upon a series of new initiatives this fall surrounding how to make such massive archives more accessible to researchers, from new technical pipelines to new analytic methodologies that help scholars bridge the gap between the human-centered small-data traditions of their disciplines and the kinds of unprecedented new possibilities offered by GDELT. We'd love to hear from you!