What Google's Cloud Video AI Sees Watching Decade Of Television News: The Visual Global Entity Graph 2.0

Kalev Leetaru

5 years ago

UPDATE (8/7/2020): Thanks to a Google Cloud COVID-19 Research Grant to the Media-Data Research Consortium, this dataset has been vastly expanded to cover all of 2020 and major disease outbreaks of the past decade.

What would it look like to have Google's state-of-the-art video understanding system Cloud Video AI watch a decade of ABC, CBS and NBC evening television news broadcasts 2010-present and CNN from Jan. 25, 2020 to present and describe the visual objects and activities it sees second-by-second, constructing a non-consumptive chronology of the visual themes and narratives that have defined global events of the past decade? How do machines "see" the news and what does it look like to understand television news through their algorithmic eyes? What fundamentally new kinds of research and journalistic questions are possible when we for the first time have a rich non-consumptive visual index of television news?

Today we are incredibly excited to unveil the results: a powerful new non-consumptive visual chronology of almost 9,700 broadcasts totaling more than 18 million seconds of airtime, using Google’s Cloud Video API to watch ten years of ABC, CBS and NBC evening news broadcasts from the Internet Archive’s Television News Archive and updating every 30 minutes with a rolling 24 hour delay.

Despite its rich visual-first nature, television news today is primarily explored through the modality of text. The Internet Archive’s Television News Archive has been a leader in this space, helping to popularize timecoded keyword search of closed captioning transcripts in the library context and exploring new research interfaces to television. While these keyword search systems provide incredible opportunities for exploring coverage trends and can be read by textual deep learning systems to catalog the things and themes they mention, at the end of the day, the very visual world that separates television from radio and the online world is absent from such analyses.

At the same time, the last few years have brought profound advances in machine vision, with algorithmic visual understanding moving from the research lab to production everyday use. What would it look like to have a state-of-the-art machine learning system watch a decade of television news broadcasts across “the big three” networks and catalog their visual themes?

Over the past several years we've explored how AI can help us better understand the visual world of television. During the 2016 election we converted all 267 campaign ads monitored by the Archive into still image sequences and had Google's Cloud Vision API catalog their visual narratives. Last year we used Google's Cloud Video, Vision, Speech to Text and Natural Language APIs to watch a week of television news from CNN, MSNBC, Fox News and the morning and evening ABC, CBS, NBC and PBS broadcasts, releasing more than 600GB of annotations that were used to explore a wide range of questions about just what it is we see when we turn on our televisions. We've even explored having Google's Cloud Natural Language API "watch" 270,000 hours of television news by reading their closed captioning spoken word transcripts and compiling the entities mentioned within. In December we released an initial prototype of the Visual Global Entity Graph 1.0 and the lessons we've learned from that dataset have been used to create this massive new 2.0 release, which will also now update daily!

To explore what it would look like to understand a decade of television news through AI, the half-hour evening television news broadcasts of ABC, CBS and NBC from July 2010 through present and CNN from January 25, 2020 through present from the Internet Archive’s Television News Archive were analyzed within a restricted access non-consumptive computational digital reference library using Google’s Cloud Video API, including its labeling feature in which it visually analyzes each second of footage and assigns a list of predefined labels describing the objects and activities depicted within.

The end result is a list of the visual themes (primarily objects and activities) found in each broadcast by second, charting its visual narratives. Such a chronology offers a powerful and unique counterpart to the textual chronology of its closed captioning, allowing researchers for the first time to consider the visual dimension of television news.

What are the visual themes that are associated with the key narratives and events of the past decade? Through combining with the captioning chronology, it becomes possible to examine what kinds of imagery is typically depicted onscreen when a given topic is discussed and how that imagery has changed over the years. Longitudinal questions such as the amount of coverage by month over the past decade devoted to civil mobilizations like protests can also be explored for the first time. This enables an entirely new approach to understanding visual narratives by using machines to sift through vast archives of video.

Perhaps most importantly, this new dataset will allow researchers to explore for the first time how a better understanding of the visual dimensions of news might help combat the spread of falsehoods and better assess the diffusion of contested narratives and inorganic campaigns. What might we learn from the rich visual processes of broadcast journalism that could be applied to help increase trust in digital journalism and contextualize the news in ways that combats misunderstanding and falsehood?

The dataset consists of one file per broadcast, running from July 15, 2010 through present (with a 24-72 hour rolling delay). Each file is in newline delimited JSON format, with each row representing one second of broadcast airtime and containing an array of all of the visual entities identified by the Video API from that second.

This raw non-consumptive dataset is intended for those with data science backgrounds comfortable with processing raw JSON data. We are in the process of constructing a user-friendly interface to this data to support researchers and journalists in their explorations of it, so stay tuned!

DETECTION MODEL CHANGES

All videos have been processed using the "v1" endpoint with "builtin/latest" model and "stationarycamera" false.

The original dataset released on February 3, 2020 contained annotations produced through two different models. The videos annotated November 2019 through January 29, 2020 had been annotated using the "v1p3beta1" endpoint with the "builtin/stable" model and "stationarycamera" true, while videos after that were annotated with the "v1" endpoint with the "builtin/latest" model and "stationarycamera" set to false. The use of a stationary camera setting for earlier broadcasts was due to the assumption that this would provide better results since a considerable amount of television news footage is filmed either in a fixed studio setting or using a fixed stationary tripod-mounted camera in the field. Experimentally we determined that television news footage actually contains a lot more moving camera footage than expected and that the setting did not provide meaningful improvements in detection accuracy for the specific characteristics of television news. One unintended consequence of using a stationary camera setting for this earlier content is that detection labels could be propagated across brief scene changes. For example, if a scene features elephants in a landscape and then briefly cuts to a commentator for a few seconds and then back to the elephants, the results may still list "elephant" as a detected entity during that brief commentator cutover for videos annotated with this setting. This created a discontinuity with some labels showing profound changes across this boundary.

To address this, as of February 28, 2020 we have replaced the original dataset, including the raw and processed JSON files and the BigQuery table with a new reprocessed dataset in which the entire set of videos were all annotated using the same "v1" endpoint with "builtin/latest" model and "stationarycamera" false. All videos have now been annotated using the exact same model, so results will be continuous over the complete decade-long collection of videos.

RAW VISUAL API OUTPUT

For those interested in the raw visual output (includes all visual-related annotations, but not speech recognition) produced by the Cloud Video API for each video, those files are also now available for download. They offer annotations at the frame level recorded in nanoseconds, enabling fundamentally new kinds of analyses.

TECHNICAL REFERENCE

Each second of airtime appears as its own row with the following fields:

date. The date and time in UTC of a specific second of a given broadcast.
showOffset. The offset in seconds of the given second since the beginning of the broadcast. For example, the moment 2 minutes and 4 seconds into a broadcast would have a value of 124 here.
iaShowId. The unique identifier assigned by the Internet Archive to this broadcast.
station. The station on which this broadcast aired.
showName. The human-readable name of the show.
iaClipUrl. The URL of the Internet Archive's Television News Archive page to view a one-minute clip of this broadcast that begins with this second of airtime. Note that seconds within the last 10 seconds of the end of a broadcast may require manual adjustment to an early timecode in some cases due to limitations of the Archive's online clip player.
iaThumbnailUrl. The URL of a 320 x 240 pixel (actual height adjusted to maintain aspect ratio) thumbnail image of the opening frame of this second of airtime. Note that this frame may not be representative of the full range of visual narratives present in the given second of airtime and thus users should consult the thumbnail images before and after a given frame for further information and utilize the iaClipURL above for dead reckoning. For those interested in the technical details, thumbnails are created using ffmpeg using the command "cat VIDEO | ffmpeg -hide_banner -loglevel panic -f mp4 -i pipe: -vf "fps=1,scale=320:-1" -start_number 0 ./FILENAME-%6d.jpg".
processedDate. The date and time in UTC that the clip was processed through Google's Cloud Video API. Like all AI systems, Google's Video API is constantly improving, meaning that the same video might yield slightly different results when run at a later time due to improvements in the underlying models. In particular, since we are using the "v1/latest" model, the results here will reflect the API's steady stream of improvements. To assist users in understanding whether an unexpected change over a given time period could be the result of underlying model changes, we include the processing date of each video that can be reconciled against the Video API's model updates.
numOCRChars. The total number of characters of text identified through OCR. The Video API performs per-frame OCR, which for the purposes of simplification we collapse to the second level, in keeping with the Archive's chyron data.
OCRText. The complete inventory of all onscreen text recognized by the Video API. Unlike the chyrons dataset, this field contains not just "lower third" text but rather the totality of all onscreen text. Note that this text may include words in multiple languages, including languages not ordinarily expected in the broadcast, such as Arabic text in the midst of an ABC broadcast.
numShotChanges. The Video API identifies the number of scene/camera transitions.
shotID. The Video API identifies each scene/camera transition. We assign a unique identifier to each scene, incremented with each transition, allowing you to group together all of the seconds that make up a given "shot".
numSpeakerChanges. Using the Video API’s “video” speech transcription model with speaker diarization enabled, we count the number of speaker transitions each minute.
numSpokenWords. Using the Video API’s “video” speech transcription model, we count the total number of words spoken each minute.
numDistinctEntities. The total number of unique visual entities (called “labels” in the Video API’s parlance) identified during the given second of airtime. Multiple appearances of a given entity in the given second of airtime will be recorded as a single entry.
entities. A JSON array containing the list of distinct visual entities identified by the API for this minute of this broadcast. Multiple appearances of the same entity during this minute will be recorded as a single entry.
- name. The human name of this visual entity such as “land vehicle” or “city.”
- mid. Provides the unique Google-assigned ID of this visual entity. More information about these can be requested from the Google Knowledge Graph Search API.
- confidence. The "confidence" score the API assigned to this entity being seen in this second of airtime.
- categories. An optional JSON array containing a list of distinct visual entities that represent the parent category headings for this entity in the overall visual taxonomy used by the Video API. For example, the entity “circle” will have “shape” in its categories list, representing that under Google’s taxonomy, a “circle” is a kind of “shape.” Not all entities have corresponding parent categories and some may have multiple parents categories.
  - name. The human name of this visual entity such as “land vehicle” or “city.”
  - mid. Provides the unique Google-assigned ID of this visual entity.
numDistinctPresenceEntities. The total number of unique "presence" entities. This field is only present for videos processed after January 30, 2020. According to Google's documentation, "Compared to the existing topical shotLabelAnnotations, this field presents more fine-grained, shot-level labels detected in video content."
- name. The human name of this visual entity such as “land vehicle” or “city.”
- mid. Provides the unique Google-assigned ID of this visual entity. More information about these can be requested from the Google Knowledge Graph Search API.
- confidence. The "confidence" score the API assigned to this entity being seen in this second of airtime.
- categories. An optional JSON array containing a list of distinct visual entities that represent the parent category headings for this entity in the overall visual taxonomy used by the Video API. For example, the entity “circle” will have “shape” in its categories list, representing that under Google’s taxonomy, a “circle” is a kind of “shape.” Not all entities have corresponding parent categories and some may have multiple parents categories.
  - name. The human name of this visual entity such as “land vehicle” or “city.”
  - mid. Provides the unique Google-assigned ID of this visual entity.

DOWNLOAD

To determine which shows are available, a daily inventory file is available in the format YYYYMMDD from July 1, 2009 through present that lists all of the files for that day. For example, for June 1, 2020 the file would be:

http://data.gdeltproject.org/gdeltv3/iatv/vgegv2/20200601.txt

The inventory files for the most recent 72 hours are updated throughout the day as shows complete processing.

The complete dataset is also available in BigQuery:

gdelt-bq.gdeltv2.vgegv2_iatv

The complete raw visual API output is also available, with the files listed in the daily inventory files above.

Note that this is a preliminary dataset that may be missing some shows or have null or incomplete values for some fields. We are excited to explore this dataset alongside of you to learn how we can use machines to peer for the first time into the visual world of the news. Remember that this dataset was constructed entirely by machine, so you will undoubtedly encounter errors and all labels are the result of algorithmic decision, not human editorialization.

Stay tuned for a series of analyses and a human-friendly research interface to this data coming shortly!

We’re tremendously excited to see the kinds of advanced multimodal visual analyses you’re able to do with this powerful new non-consumptive dataset!