The GDELT Project

What Google's Cloud Video AI Sees Watching Decade Of Television News: The Visual Global Entity Graph 2.0

UPDATE (8/7/2020): Thanks to a Google Cloud COVID-19 Research Grant to the Media-Data Research Consortium, this dataset has been vastly expanded to cover all of 2020 and major disease outbreaks of the past decade.

What would it look like to have Google's state-of-the-art video understanding system Cloud Video AI watch a decade of ABC, CBS and NBC evening television news broadcasts 2010-present and CNN from Jan. 25, 2020 to present and describe the visual objects and activities it sees second-by-second, constructing a non-consumptive chronology of the visual themes and narratives that have defined global events of the past decade? How do machines "see" the news and what does it look like to understand television news through their algorithmic eyes? What fundamentally new kinds of research and journalistic questions are possible when we for the first time have a rich non-consumptive visual index of television news?

Today we are incredibly excited to unveil the results: a powerful new non-consumptive visual chronology of almost 9,700 broadcasts totaling more than 18 million seconds of airtime, using Google’s Cloud Video API to watch ten years of ABC, CBS and NBC evening news broadcasts from the Internet Archive’s Television News Archive and updating every 30 minutes with a rolling 24 hour delay.

Despite its rich visual-first nature, television news today is primarily explored through the modality of text. The Internet Archive’s Television News Archive has been a leader in this space, helping to popularize timecoded keyword search of closed captioning transcripts in the library context and exploring new research interfaces to television. While these keyword search systems provide incredible opportunities for exploring coverage trends and can be read by textual deep learning systems to catalog the things and themes they mention, at the end of the day, the very visual world that separates television from radio and the online world is absent from such analyses.

At the same time, the last few years have brought profound advances in machine vision, with algorithmic visual understanding moving from the research lab to production everyday use. What would it look like to have a state-of-the-art machine learning system watch a decade of television news broadcasts across “the big three” networks and catalog their visual themes?

Over the past several years we've explored how AI can help us better understand the visual world of television. During the 2016 election we converted all 267 campaign ads monitored by the Archive into still image sequences and had Google's Cloud Vision API catalog their visual narratives. Last year we used Google's Cloud Video, Vision, Speech to Text and Natural Language APIs to watch a week of television news from CNN, MSNBC, Fox News and the morning and evening ABC, CBS, NBC and PBS broadcasts, releasing more than 600GB of annotations that were used to explore a wide range of questions about just what it is we see when we turn on our televisions. We've even explored having Google's Cloud Natural Language API "watch" 270,000 hours of television news by reading their closed captioning spoken word transcripts and compiling the entities mentioned within. In December we released an initial prototype of the Visual Global Entity Graph 1.0 and the lessons we've learned from that dataset have been used to create this massive new 2.0 release, which will also now update daily!

To explore what it would look like to understand a decade of television news through AI, the half-hour evening television news broadcasts of ABC, CBS and NBC from July 2010 through present and CNN from January 25, 2020 through present from the Internet Archive’s Television News Archive were analyzed within a restricted access non-consumptive computational digital reference library using Google’s Cloud Video API, including its labeling feature in which it visually analyzes each second of footage and assigns a list of predefined labels describing the objects and activities depicted within.

The end result is a list of the visual themes (primarily objects and activities) found in each broadcast by second, charting its visual narratives. Such a chronology offers a powerful and unique counterpart to the textual chronology of its closed captioning, allowing researchers for the first time to consider the visual dimension of television news.

What are the visual themes that are associated with the key narratives and events of the past decade? Through combining with the captioning chronology, it becomes possible to examine what kinds of imagery is typically depicted onscreen when a given topic is discussed and how that imagery has changed over the years. Longitudinal questions such as the amount of coverage by month over the past decade devoted to civil mobilizations like protests can also be explored for the first time. This enables an entirely new approach to understanding visual narratives by using machines to sift through vast archives of video.

Perhaps most importantly, this new dataset will allow researchers to explore for the first time how a better understanding of the visual dimensions of news might help combat the spread of falsehoods and better assess the diffusion of contested narratives and inorganic campaigns. What might we learn from the rich visual processes of broadcast journalism that could be applied to help increase trust in digital journalism and contextualize the news in ways that combats misunderstanding and falsehood?

The dataset consists of one file per broadcast, running from July 15, 2010 through present (with a 24-72 hour rolling delay). Each file is in newline delimited JSON format, with each row representing one second of broadcast airtime and containing an array of all of the visual entities identified by the Video API from that second.

This raw non-consumptive dataset is intended for those with data science backgrounds comfortable with processing raw JSON data. We are in the process of constructing a user-friendly interface to this data to support researchers and journalists in their explorations of it, so stay tuned!

DETECTION MODEL CHANGES

The original dataset released on February 3, 2020 contained annotations produced through two different models. The videos annotated November 2019 through January 29, 2020 had been annotated using the "v1p3beta1" endpoint with the "builtin/stable" model and "stationarycamera" true, while videos after that were annotated with the "v1" endpoint with the "builtin/latest" model and "stationarycamera" set to false. The use of a stationary camera setting for earlier broadcasts was due to the assumption that this would provide better results since a considerable amount of television news footage is filmed either in a fixed studio setting or using a fixed stationary tripod-mounted camera in the field. Experimentally we determined that television news footage actually contains a lot more moving camera footage than expected and that the setting did not provide meaningful improvements in detection accuracy for the specific characteristics of television news. One unintended consequence of using a stationary camera setting for this earlier content is that detection labels could be propagated across brief scene changes. For example, if a scene features elephants in a landscape and then briefly cuts to a commentator for a few seconds and then back to the elephants, the results may still list "elephant" as a detected entity during that brief commentator cutover for videos annotated with this setting. This created a discontinuity with some labels showing profound changes across this boundary.

To address this, as of February 28, 2020 we have replaced the original dataset, including the raw and processed JSON files and the BigQuery table with a new reprocessed dataset in which the entire set of videos were all annotated using the same "v1" endpoint with "builtin/latest" model and "stationarycamera" false. All videos have now been annotated using the exact same model, so results will be continuous over the complete decade-long collection of videos.

RAW VISUAL API OUTPUT

For those interested in the raw visual output (includes all visual-related annotations, but not speech recognition) produced by the Cloud Video API for each video, those files are also now available for download. They offer annotations at the frame level recorded in nanoseconds, enabling fundamentally new kinds of analyses.

TECHNICAL REFERENCE

Each second of airtime appears as its own row with the following fields:

DOWNLOAD

To determine which shows are available, a daily inventory file is available in the format YYYYMMDD from July 1, 2009 through present that lists all of the files for that day. For example, for June 1, 2020 the file would be:

The inventory files for the most recent 72 hours are updated throughout the day as shows complete processing.

The complete dataset is also available in BigQuery:

The complete raw visual API output is also available, with the files listed in the daily inventory files above.

Note that this is a preliminary dataset that may be missing some shows or have null or incomplete values for some fields. We are excited to explore this dataset alongside of you to learn how we can use machines to peer for the first time into the visual world of the news. Remember that this dataset was constructed entirely by machine, so you will undoubtedly encounter errors and all labels are the result of algorithmic decision, not human editorialization.

Stay tuned for a series of analyses and a human-friendly research interface to this data coming shortly!

We’re tremendously excited to see the kinds of advanced multimodal visual analyses you’re able to do with this powerful new non-consumptive dataset!