Computers Watching Ads: Deep Learning Meets Campaign 2016

What would it look like to have a computer watch all of the 2016 political ads using "deep learning" neural network algorithms and describe them second by second? Using the Internet Archive's new Political TV Ad Archive, launched two weeks ago, we took all 267 distinct ads totaling 196 minutes (which have collectively aired 72,807 times as of this writing) and fed them through Google's Cloud Vision API. Since the Vision API currently operates only on static imagery, not video, each campaign ad was downloaded as an MPEG4 file and split into a sequence of frames, one per second using the Linux "ffmpeg" utility ("ffmpeg -i ./VIDEO.MP4 -r 1 frames%d.jpg").

All 11,757 resulting frames were then passed to the Cloud Vision API one by one. The output of the Vision API's analysis of each image is available in JSON format, recording the list of descriptive tags, human faces and their corresponding emotions, OCR'd text, recognized geographic landmarks, Google SafeSearch detection, and logos identified by the Vision API's algorithms.

By analyzing each video second by second, it is possible to analyze campaign ads as content sequences, preserving their temporal structure and making it possible to identify patterns in where in advertisements certain kinds of content tends to appear.

To our knowledge this is one of the first applications of this scale of deep learning to content analyze television advertising. In short, computers watching television.


Unlike the other text-based GDELT datasets, please note that this pilot dataset is extremely experimental and is being released as an alpha release. The use of deep learning algorithms for image recognition is still a highly experimental area of active research and the application use case presented here is an extremely difficult one. In particular, television ads tend to be relatively low resolution, often below the recommended resolution of the Vision API. Furthermore, the Vision API was designed for static still imagery, not still frame captures from video, and the resulting motion blur, motion compression artifacts and other video issues can negatively impact the results.

What this means is that you will almost certainly encounter a certain level of error in the categorizations and other information computed about each image. Remember that all tags are applied 100% automatically with NO HUMAN INTERVENTION and mistaken categorizations or other tags represent computer algorithm errors, NOT editorial statements. Remember that computer image recognition at these scales is still in its relative infancy and the underlying algorithms are encountering large amounts of imagery utterly unlike anything they’ve ever seen before, so this data stream is really pushing the boundaries of current deep learning recognition and will make mistakes. Please email with the image URL and mistaken metadata fields if you find any particularly significant errors so that the system can be constantly improved and refined.


UPDATE: 2/7/2016 the file now contains the URLs to the image frames analyzed

With the caveats above in mind, you can download the final dataset below. Each row represents one second of one of the 267 campaign ads. There are seven columns: the Internet Archive Unique Identifier of the ad, the Archive's Details URL that offers more detail about the ad (along with the video itself), the specific second of the video examined in that row, the ZIP file containing the image frame (images are split across two ZIP files totaling 350MB), the filename of the image frame within the ZIP file, the online preview of the image (this URL allows you to download the image frame, but can be very slow and should be used only for spot checking a small selection of images – download the ZIP files to work with large subsets of the imagery), and the full JSON output of the Vision API's analysis of the image. The JSON should be relatively self-explanatory, but the primary fields include "labelAnnotations" (the specific list of descriptive tags assigned to the image), "faceAnnotations" (the list of human faces and their bounding boxes and other attributes), "landmarkAnnotations" (recognized prominent geographic landmarks found in the image), "logoAnnotations" (well known logos identified in the image), "textAnnotations" (OCR text recognition), and "safeSearchAnnotations" (Google SafeSearch results).