What would it look like to have a computer watch a week of television news from CNN, MSNBC and Fox News, along with the morning and evening broadcasts of ABC, CBS, NBC and PBS using off-the-shelf video, image, speech recognition and natural language deep learning algorithms and describe everything it saw? How might the non-consumptive analysis of deep learning help us understand the visual world of television news in an entirely new light and especially how might those new insights help us combat the global rise of misinformation, disinformation and foreign influence?
Three years ago, we used Google’s Vision AI system to watch all 267 distinct presidential campaign ads cataloged by the Internet Archive at that point and describe what it saw. The resulting analysis powerfully illustrated the incredible new insights AI could offer in helping to understand the visual world of television at scale.
Three years later, what might it look like to take this a step further, analyzing a full week of television news?
Given that deep learning algorithms require no human intervention, could AI offer a powerful new non-consumptive workflow for understanding video content? How might transformations like reducing videos to a series of 1fps preview thumbnails analyzed as isolated still images provide new modalities of non-consumptive analysis that create even further distance from the original source material without substantially altering the results?
Thinking more broadly, could these thematic annotations, coupled with reverse image search, offer an insightful new tool for addressing the poisoning of our information commons?
To explore these questions further, we worked with the Internet Archive’s Television News Archive to analyze one week of television news coverage, covering CNN, MSNBC and Fox News and the morning and evening broadcasts of San Francisco affiliates KGO (ABC), KPIX (CBS), KNTV (NBC) and KQED (PBS) from April 15 to April 22, 2019, totaling 812 hours of television news.
This week was selected due to it having two major stories, one national (the Mueller report release on April 18th) and one international (the Notre Dame fire on April 15th).
Each video was processed using Google’s Video AI API, with its Label Detection (describing the objects and activities depicted in each frame), Shot Change Detection (flagging when there is a scene change), Explicit Content Detection (unlikely in broadcast television news but potentially identifying depictions of violence), OCR Text Detection (transcribing chyrons and other overlays) and Object Tracking (drawing bounding boxes around the major objects in a scene and how they move) features enabled.
Each video was also processed using “ffmpeg” to convert it into a sequence of 1fps preview images which were then processed using Google’s Vision AI API, with its Face Detection (this only counts the number of human faces present in an image – it does NOT perform any kind of facial recognition), Logo Detection (recognizing major logos), Label Detection (describing the objects and activities depicted), Landmark Detection (attempts to identify the geographic location of the image’s background), OCR Text Detection (transcribing chyrons and other overlays), Image Properties (computes general image properties like color distribution), Safe Search (unlikely in broadcast television news but potentially identifying depictions of violence), Object Localization (drawing bounding boxes around the major objects in a scene) and Web Entities (performs the equivalent of a reverse Google Images search and identifies both images on the web similar to the given video frame and the major topics those similar images are captioned with across the web) features enabled.
Google’s Video AI API actually has built-in speech transcription available but in this case we decided to process each video through Google’s Cloud Speech-to-Text API since, looking to the future, it supports a more expansive set of languages (120 languages and counting). Each video was processed using “ffmpeg” to generate a 16-bit PCM WAV file of its audio which was then processed using the Speech-to-Text API using the “video” model, with Word-Level Confidence (returns for each word how confident it was in its identification of that word), Word Timestamps (the precision start/stop timecodes for the start and duration of each word) and Automatic Punctuation (inserts punctuation in accordance with standard English grammar) features enabled.
Finally, both the original station-generated closed captioning and the machine-generated Cloud Speech-to-Text transcripts were processed using Google’s Natural Language API using its Analyze Entities mode to identify entities like persons, locations and major topics.
Thus, the Video AI API analyzes the visual contents of the video in their native form. The Vision AI API captures the results of converting videos to a sequence of still images. The Speech-to-Text API makes it possible to generate transcripts of videos in 120 languages, allowing their spoken contents to be analyzed. Finally, the Natural Language API allows those spoken word transcripts to be analyzed to understand the topics discussed in the course of each broadcast. In total, 1.35TB of 1fps still image frames, 283GB of video, 239GB of audio, 1.6GB of machine-generated transcripts and 61MB of captioning were processed.
Combined, this represents the four modalities of current deep learning content understanding: video, imagery, speech and text.
Why analyze television using deep learning instead of having humans watch them and catalog what they see?
Beyond the obvious greater scalability of automated analysis, a far more important benefit is that deep learning brings non-consumptive analysis to the visual world.
Books are now routinely analyzed using machine learning algorithms that can read millions of books and offer narrative, topical, linguistic and other statistics about each book without a human ever being able to see any of the underlying text. Visual materials have long resisted such approaches due to the lack of algorithms that could reliably understand them. As deep learning algorithms have matured, we now have tools that can bring non-consumptive analysis to images and video, watching hundreds/thousands/millions/billions of hours of television and summarizing the key visual and spoken narratives without a human ever being able to watch any of the source material.
This takes on even greater importance in this era of rampant misinformation, disinformation and foreign influence.
Once those deep learning algorithms have generated a catalog of the narratives and topics discussed in a given television news broadcast, that broadcast can be connected to other coverage of each event it covers from around the world. A tweet claiming that Notre Dame is burning could be instantly connected to live verified and trustworthy footage on the major television networks confirming the story and providing additional context. Similarly, a television broadcast about the fire could be connected to relevant online references and local coverage from the scene to help a viewer understand the story from multiple angles and explore more of the detailed history of the cathedral.
Television is often first on the scene of major breaking stories and offers critical live coverage, making it an ideal medium through which to combat the flood of falsehoods that propagate during major events. Being able to connect television coverage to online news and social media would offer a powerful tool to combat this spread.
Visual similarity could also be used to combat “deep fakes” purporting to be real footage from major stations. Imagine a viral tweet purporting to be a clip from a television station announcing a major breaking story or making a harmful on-air statement. Image analysis could determine that the clip never actually aired on the station, confirming it as a hoax.
What might it look like to visually compare television coverage of a story to online coverage?
Recall that we applied Google Vision AI’s “Web Entities” analysis to the 1fps preview images of each video. Web Entities performs the equivalent of a reverse Google Images search, scanning all of the imagery Google has seen across the open web to find images that look most similar to the given image. In this case, the Vision AI API took each 1fps frame and identified any web images that looked similar to it, connecting television to the web.
If the Vision AI finds a similar image, it returns the URL of the matching image and whether it was merely somewhat similar looking, whether part of it was a near-duplicate or whether the entire web image was almost identical to the search image. It also summarizes the top topics in the captions of those similar web images and returns them as descriptive labels.
This means that a frame from a clip of Robert Mueller will return a list of entities including his name, Donald Trump, the presidency, the White House, Russia and other related topics. The API does not perform facial recognition of any kind. Instead, it is merely searching the open web for similar-looking images and returning the top topics they are captioned with.
In short, without actually understanding what is in the image, we can essentially “crowdsource” the open web to generate a consensus description of each frame.
Imagine ultimately a global consortium of diverse public stakeholders working together to explore how deep learning and other technologies can be brought to bear in creative and innovative ways to combat the spread of digital falsehoods. Using a kind of “virtual reading room” researchers could apply non-consumptive analytics to find new ways of identifying, understanding and combatting the spread of false information, especially multimodal approaches that look across video, imagery, the spoken word and traditional text. From identifying “deep fakes” and falsified facts to lending context to complex debates, such an innovative collaboratory, centered around non-consumptive analysis could offer a powerful testbed to examine the next generation of approaches to combat online misinformation, disinformation and foreign influence.
To jumpstart this conversation, we are releasing today an initial dataset of selected results from the analysis of the week of television news through Google’s Video AI and Vision AI APIs and the results of the Cloud Natural Language API’s analysis of the closed captioning and Cloud Speech-to-Text transcript. The actual video files themselves and the full captioning and Speech-to-Text transcripts are not available, but these machine annotations offer a first glimpse at what this new form of non-consumptive deep learning analysis of television news might look like.
Once again, the videos and full captioning and transcripts are not available, only these machine annotated results.
Deep learning is still very much experimental and an area of active research and the results here were 100% machine generated. What this means is that you will almost certainly encounter a certain level of error in the results: these represent algorithmic errors, not editorial statements.
ANNOTATION DATASETS
With the caveats above in mind, the following annotation datasets are available for download. The videos, captioning and transcripts themselves are not available, only these machine annotations.
The Internet Archive’s Television News Archive makes it possible to view a brief preview clip of any portion of each of these broadcasts from a minimum of 10 seconds to a maximum of 60 seconds. To view what was being said 2 minutes and 10 seconds (130 seconds) into CNN’s April 14th 5PM PDT broadcast, you would use the URL https://archive.org/details/CNNW_20190415_000000_Andrew_Yang_CNN_Presidential_Town_Hall/start/125/end/135, with 125 being the start time in seconds and 135 being the end time in seconds, with the desired time (130 seconds into the clip) being in the middle, forming the minimum 10 second clip.
Any timecode can be converted into this preview URL format by simply using the format “https://archive.org/details/SHOWNAME/start/START_IN_SECONDS/end/END_IN_SECONDS”.
The following annotation files are available. The video and vision annotations can be downloaded individually, but combined, all three datasets total around 29GB compressed and around 615GB uncompressed.
- Video AI Annotations. These are available as a set of gzip’d JSON files, one per show (format documentation). The full list of files is listed in FILELIST-VIDEOAIANNOTATIONS.TXT. Combined, the files are 3.3GB compressed and 73GB uncompressed.
- Vision AI Annotations. These are available as a set of gzip’d tar files, one per show, containing individual JSON results for each 1fps frame (format documentation). The full list of files is listed in FILELIST-VISIONAIANNOTATIONS.TXT. combined, the files are 25GB compressed and 540GB uncompressed.
- Natural Language Annotations. These are available as a set of gzip’d JSON files, one per show (format documentation). One major change is that the character offsets returned by the NLP API have been replaced with timecode offsets (the “beginOffset” character offset field has been replaced with “timeOffset” that specifies the appearance of the entity in seconds from the beginning of the video). In addition, all entities of type “OTHER” have been removed since otherwise the results contain every single noun phrase. There are two files per video, one “.caption.nlp.annotations.json” contains the results of the NLP API applied to the station-provided captioning, while “.transcript.nlp.annotations.json” contains the results of the NLP API applied to the machine generated transcript from the Speech-to-Text API. Given their small size, the full set of annotations has been boxed up inside a gzip’d tar file tar.gz. The tar archive is 33MB compressed and 462MB uncompressed.
SUMMARY EXTRACTS
Given the extremely large size of the full annotation datasets, we've also created several summary datasets that allow you to explore slices of the data in a much smaller extracted dataset.
Vision AI Summaries
The following summary files were generated from the Vision AI results. Each is a tar archive with one summary file per video.
- Face Count. Each row represents a 1fps frame in that video, with the first column being the frame number and the second column counting the total number of human faces in that frame. Note that the API only counts faces that are clearly visible, meaning it does not count the number of people in crowd shots. Remember that this does not perform any form of facial recognition, it only counts the number of faces present in each frame. [DOWNLOAD]
- Labels. Each row represents a 1fps frame in that video, with the first column being the frame number and the second column being a semicolon-delimited list of all of the Labels the Vision AI associated with that frame. These represent how the AI algorithm itself sees each frame. [DOWNLOAD]
- Matching Images. Each row represents an image that Web Detection returned in the "fullMatchingImages" section. These represent extremely similar images from the open web that resemble the given 1fps frame. Note that the API performs fuzzy, not exact matches, so these may be unrelated images that merely look similar. Each row has four columns: frame number, URL of the web page containing the matching image, the title of the web page containing the matching image and the matching image URL itself. [DOWNLOAD]
- OCR. Each row represents a 1fps frame in that video, with the first column being the frame number and the second column containing the complete OCR results of all recognizable text in that frame. Tabs and newlines are escaped. [DOWNLOAD]
- OCR Ngrams. This is a simple spreadsheet that parsed the OCR summary extract above and compiled a histogram of all words found onscreen for each station over the week. Counts report the total number of times each word appeared on that station, meaning if it appeared multiple times in a given frame it will be counted multiple times. [DOWNLOAD]
- OCR Ngrams Second Counts. This is a simple spreadsheet that parsed the OCR summary extract above and compiled a histogram of all words found onscreen for each station over the week. Counts report the total number of SECONDS that contained the given word, meaning if a word like "the" appeared multiple times in a single 1fps frame, it will be counted only once. [DOWNLOAD]
- OCR Twitter Mentions: President Trump Tweets. This is a simple spreadsheet that parsed the OCR summary extract above and identified all frames that referenced a tweet from President Trump. The first column is the show, second column is the second offset within the show where the mention occurred and the third is the full OCR'd text of that frame. [DOWNLOAD]
- OCR Twitter Mentions: All Tweets. This is a simple spreadsheet that parsed the OCR summary extract above and identified all frames that appeared to reference a social media account under the assumption that most appearances of "@" followed by a username is a reference to a Twitter handle. The first column is the show, second column is the second offset within the show where the mention occurred, the third is the extracted username and the fourth is the full OCR'd text of that frame. [DOWNLOAD]
- OCR Twitter Mentions: All Tweets Histogram. This is a histogram of the "All Tweets" dataset above that lists all of the extracted Twitter usernames in order by the number of total seconds they were seen across the seven stations. Note that many of these are merely OCR errors of real usernames. [DOWNLOAD]
- Web Entities. Each row represents a 1fps frame in that video, with the first column being the frame number and the second column containing a semicolon-delimited list of all of the Web Entities returned for that frame. The Vision AI API performs the equivalent of a reverse Google Images search for the frame and then tallies up the top topics most commonly found in the captions of the most similar-looking images across the web, offering an estimate of the topics that might be found in the image. [DOWNLOAD]
We are tremendously excited to see how researchers use this prototype annotations dataset. Please contact kalev.leetaru5@gmail.com with questions.