The GDELT Project

A Look At GDELT's Open AI Datasets: Video, Imagery, Speech And Natural Language Understanding

GDELT has been an early alpha user of many of Google's Cloud AI offerings, exploring how the rapidly advancing world of deep learning video, still image, speech and text understanding can be harnessed to fundamentally reshape how machines can make sense of global deluge each day to do everything from cataloging wildlife crime to helping biosurveillance organizations flag emerging disease outbreaks to forecasting the risk of conflict, while looking to the future of everything from systems that can perform open-ended relationship identification to new ways of reconciling the divergent landscapes of visual and textual representation.

To help you get started, we've compiled a list below of our largest AI-annotated datasets.

We're tremendously excited to see what all of you can do with these massive and incredibly rich open datasets!

Cloud Video API

Working with the Internet Archive's Television Internet Archive's Television News Archive, GDELT has non-consumptively analyzed a growing number of television news broadcasts through Google's Cloud Video API.

Cloud Vision API

Since December 2015 we have used Google's Cloud Vision API to non-consumptively analyze more than half a billion worldwide online news images totaling more than a quarter trillion pixels, yielding more than 300 billion datapoints.

Cloud Natural Language API

Since July 2016 we have annotated a small daily sample of worldwide English language online news coverage through Google's Cloud Natural Language API, expanding in 2020 to all 11 languages it supports and expanding to other forms of textual content like television news transcripts.

Cloud Speech To Text API & Video ASR

As part of our video datasets above, we analyzed all of the broadcasts through either Google's Speech to Text API or the Video API's own "video" speech recognition model. At this time we have not released the raw transcript files, but we have released several entity datasets extracted from the transcripts.