A Look At GDELT's Open AI Datasets: Video, Imagery, Speech And Natural Language Understanding

Kalev Leetaru

6 years ago

GDELT has been an early alpha user of many of Google's Cloud AI offerings, exploring how the rapidly advancing world of deep learning video, still image, speech and text understanding can be harnessed to fundamentally reshape how machines can make sense of global deluge each day to do everything from cataloging wildlife crime to helping biosurveillance organizations flag emerging disease outbreaks to forecasting the risk of conflict, while looking to the future of everything from systems that can perform open-ended relationship identification to new ways of reconciling the divergent landscapes of visual and textual representation.

To help you get started, we've compiled a list below of our largest AI-annotated datasets.

We're tremendously excited to see what all of you can do with these massive and incredibly rich open datasets!

Cloud Video API

Working with the Internet Archive's Television Internet Archive's Television News Archive, GDELT has non-consumptively analyzed a growing number of television news broadcasts through Google's Cloud Video API.

Visual Global Entity Graph (VGEG) 2.0. Nearly the entirety of the ABC, CBS and NBC evening news broadcasts of the past decade have been analyzed, totaling 9,700 distinct broadcasts and 18 million seconds of airtime. Two versions of this dataset are available: the raw original JSON output files as-is from the Cloud Video API and a summarized version that aggregates by second of airtime. You can also interactively search the dataset as well using the AI TV Explorer. Through a collabration with the Media-Data Research Consortium (M-DRC), their Google Cloud COVID-19 Research Grant “Quantifying the COVID-19 Public Health Media Narrative Through TV & Radio News Analysis" has allowed for the processing of BBC, CNN, MSNBC and Fox News from Jan. 1, 2020 through present and CNN, MSNBC and Fox News during a number of major disease outbreaks of the past decade.
- Raw JSON Dataset
- Summary Airtime Dataset
One Week Pilot Television Analysis. In May 2019 we conducted an early pilot television analysis, analyzing a week of CNN, MSNBC and Fox News and the morning and evening broadcasts of ABC, CBS, NBC and PBS. The raw API JSON output is available, along with a number of summary datasets.
- Complete Dataset.

Cloud Vision API

Since December 2015 we have used Google's Cloud Vision API to non-consumptively analyze more than half a billion worldwide online news images totaling more than a quarter trillion pixels, yielding more than 300 billion datapoints.

Visual Global Knowledge Graph (VGKG). The raw Cloud Vision API JSON annotation output for all half billion images, containing the date the image was seen, its URL, the URL of the first article it was seen in, and additional JSON enrichments containing fields like its complete EXIF metadata, perceptual hashes and several other data points.
- Complete Dataset. (Also available in 1 minute updates).
Deep Learning Meets Campaign 2016. In early 2016 we ran all 267 ads from the Internet Archive's Political TV Ad Archive through Cloud Vision, sampling them at 1fps, to perform a visual video analysis through still imagery.
- Complete Dataset.

Cloud Natural Language API

Since July 2016 we have annotated a small daily sample of worldwide English language online news coverage through Google's Cloud Natural Language API, expanding in 2020 to all 11 languages it supports and expanding to other forms of textual content like television news transcripts.

Global Entity Graph (GEG) GCNLAPI. More than 11.3 billion entities identified from more than 103 million online news articles 2016-present (English only through 2020, then 11 languages since), along with document-level sentiment scores.
- Complete Dataset.
Web PartOfSpeech Dataset. More than 101 billion tokens of English (expanded to 11 languages 2020-present) from around 100 million online news articles 2016-present have been annotated with part of speech information (tag, aspect, case, form, gender, mood, number, person, proper, reciprocity, tense and voice) and dependency parse label, including sample snippets of each unique usage.
- Complete Dataset.
Global Entity Graph (GEG) G1 Baseline. Intended as a comparison dataset to the neural-constructed entity graph above, annotates most daily English language articles through a classical HMM + grammar-based extractor with Wikipedia normalization.
- Complete Dataset.
The Television News Global Entity Graph 2.0. The textual spoken word transcripts of more than 300,000 television news broadcasts from BBC News, ABC, CBS, CNN, MSNBC, NBC, and Fox News spanning more than a decade and totaling a quarter million hours of television news were analyzed by Google's Cloud Natural Language API, cataloging more than 470 million entity mentions.
- Complete Dataset.

Cloud Speech To Text API & Video ASR

As part of our video datasets above, we analyzed all of the broadcasts through either Google's Speech to Text API or the Video API's own "video" speech recognition model. At this time we have not released the raw transcript files, but we have released several entity datasets extracted from the transcripts.