Cultural Computing at Literature Scale: Encoding the Cultural Knowledge of Tens of Billions of Words of Academic Literature

Representing the first large-scale content analysis of JSTOR, DTIC, or the Internet Archive ever performed, this latest GDELT collaboration, with Timothy Perkins and Chris Rewerts of the US Army Corps of Engineers, codified the entire holdings of JSTOR (academic literature), declassed/unclassed DTIC (US Government publications), and the Internet Archive (the open web) relating to humanities and social sciences academic literature about Africa and the Middle East and built a massive knowledgebase that encodes all of the underlying socio-cultural knowledge from more than 21 billion words of material.  It includes the majority of ethnic, religious, and social groups, 275 themes, geography, and the entire citation graph, which making it possible to literally map what's been studied in the academic literature about a particular area dating back half a century, and, through a new "find an expert" service, get back a list of the top scholars whose works covering that group/location/topic/combination therein are the most heavily-cited.

This is just the first in a series of forthcoming announcements expanding the datasets monitored by GDELT – in this case encoding the enormous wealth of knowledge encoded in the humanities and social sciences (ranging from historians to ethnographers, linguists to political scientists) and encode that knowledge in such a way that it becomes far more accessible to realtime analysis and longitudinal trend analysis.  In the next few months we will also be demonstrating how this data can be used to contextualize contemporary realtime news and social media alerts to instantly provide a rich background report on the deeper story of modern events, as well as experts that may be able to provide further contextualization of the events.

 

ABSTRACT

The vast array of academic literature published by the humanities and social sciences disciplines codifies our collective scholarly understanding of how societies function and the beliefs, ideals, and ethnic, religious, and tribal contexts that undergird global societal behavior, yet this material has been largely absent from the recent computational revolution in the study of culture. Applying temporal, geographic, thematic, and citation algorithms to an archive of more than 21 billion words spanning 1.5 million publications from 7 collections, including the entire contents of JSTOR, DTIC, CORE, CiteSeerX, and the Internet Archive's 1.6 billion PDFs, academic literature is seen to offer a powerful new lens onto global culture. Four case studies demonstrate using this archive to map the Nuer ethnic group and identify its top experts, map the literature on food and water security, explore the thematic underpinnings of the Rwandan genocide, and construct a network over the ethnic groups of the world as seen through the combined academic literature of the past half century.

 

Read the Full Paper.