NY-NLP: NLP At Planetary Scale: GDELT's Language Understanding Datasets

Kalev is the May 11th speaker for the Natural Language Processing – New York (NY-NLP) meetup.

GDELT today encompasses more than 3 trillion datapoints spanning more than 200 years across 152 languages. Its datasets span text, imagery and video, enabling fundamentally new kinds of multimodal analyses covering nearly 200 billion frontpage links alone, 100 billion neural-annotated words, a billion textual articles, half a billion images totaling a quarter-trillion pixels and ten years of television news broadcast annotations, among myriad others, while posing novel methodological challenges from blending structured visual and unstructured textual understandings to translating from nanosecond frame-level machine precision to the coarse human airtime metrics of traditional content analysis.

Many of GDELT’s datasets revolve around text.  At the macro level, ngram datasets cover a decade of television news in the US and Europe and a year and a half of online news covering 152 languages. More than 100 billion tokens of online news in 11 languages spanning half a decade have been annotated through Google’s Natural Language API, with their part of speech and dependency labels and example snippets of each use case cataloged in 15 minute intervals, allowing an in-depth look at how language is evolving. SOTA neural entity extraction can be seen alongside classical HMM grammar-based extractors, while neural sentiment can be aligned with several thousand classical BOW sentiment dictionaries. A 152-language realtime quotation dataset offers a live look at public statements, while a frontpage link dataset offers insights into editorialized language. A new initiative is exploring NLP beyond traditional born-textual domains, from image and video OCR to the unique challenges in applying NLP to ASR’s stream-of-consciousness spoken word and news domain idiosyncrasies, both technical (uncased infinite word streams and transcription error) and methodological (stream-of-consciousness speech).

This talk will walk through GDELT’s full range of textual datasets and APIs suitable for NLU, touching on example workflows and some of the fascinating insights we’ve gained into the consciousness of global society and performing planetary-scale language analysis.

Learn More.