Since its launch in December 2015, the Visual Global Knowledge Graph (VGKG) has cataloged more than half a billion worldwide news images, computing a rich metadata index using the full power of Google's Cloud Vision API. Within this vast archive are more than 5.5 million annotated images of protests, demonstrations and other mass-scale civil mobilizations that have been published by mainstream news outlets across the world. These images offer rich groundtruthed insights into global protest trends, but exploring them up until now has required searching through the more than 545 million records of the full VGKG totaling almost 13TB of JSON.
Given the rapidly growing interest in protest-related datasets, we've decided to create a special extract of these 5.5 million image annotations.
To make it as easy as possible to work with the data, we've created two extracts.
The first is a simple CSV file with the first row being the date we first saw the image in YYYYMMDDHHMMSS format, the second being the URL of the image itself and the third being the URL of the first article we saw the image in (if we saw the image in multiple articles only the first instance is reported in the VGKG, with the rest noted in the RelatedImages URL list field in the GKG). This file is intended for researchers that are primarily interested in a list of human-readable URLs of articles they can distribute to human analysts for further research or hand-coding.
The second is a massive newline-delimited JSON file containing the same fields as above, along with the full JSON output of the Cloud Vision API's analysis of each image. Each row represents an image and is a JSON object, with the "RawJSON" field containing the escaped JSON output of the Cloud Vision API. To use this dataset, load each JSON record and then extract the contents of the "RawJSON" field and parse as its own JSON record. This dataset is intended for advanced researchers and contains a vast wealth of computed descriptive metadata about each image, representing the state-of-the-art in current machine visual understanding. The rich array of fields, including WebEntities, can be used for extremely sophisticated filtering and analysis. See the documentation for the Cloud Vision API to understand the meaning of the "responses" block and the VGKG documentation for the meaning of the "ImageProperties" block, which includes all embedded EXIF metadata, perceptual hashes, histogram-based entropy metrics and an array of other fields.
These datasets consist only of image and article URLs and machine-generated descriptive annotations. No image data is included of any kind. To access the images, use the image and article URLs. Images and articles which no longer exist can frequently be found in the Internet Archive's Wayback machine through GDELT's collaboration with the Archive to preserve global online news.
Note that all annotations are entirely machine-generated. Labels represent visually-determined tags applied by Google's visual recognition algorithms, while Web Entities represent what amount to web-scale crowdsourced tags in which the Vision API performs a reverse image search across the open web for the given image and lists the most common objects and activities mentioned in the textual captions of those images across the open web.
Download the datasets below:
- CSV Dataset. (500MB compressed / 1.2GB uncompressed)
- JSON Dataset. (34GB compressed / 200GB uncompressed)
We're tremendously excited to see what you're able to do with this incredible new dataset!
TECHNICAL DETAILS
Both extracts were constructed from the master VGKG in BigQuery. To build the CSV export, the following query was used:
select DATE, ImageURL, DocumentIdentifier from ( SELECT DATE, ImageURL, DocumentIdentifier, ROW_NUMBER() OVER(PARTITION BY ImageURL) rowid FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` WHERE ( LOWER(JSON_EXTRACT(RawJSON, "$.responses[0].labelAnnotations")) like '%protest%' or LOWER(JSON_EXTRACT(RawJSON, "$.responses[0].labelAnnotations")) like '%demonstration%' or LOWER(JSON_EXTRACT(RawJSON, "$.responses[0].labelAnnotations")) like '%rebellion%' ) ) where rowid=1
The VGKG uses a 24-hour rolling URL cache to prevent images that appear in multiple articles from being recrawled. However, sometimes images are reused by an outlet after a longer period of time, such as weeks, months or even years later, meaning a small number of images can appear multiple times in the VGKG. The ROW_NUMBER() filter above eliminates these duplicates.
BigQuery shards large output tables when exporting to GCS with a header row in each, so "tail +2" is applied to each output file prior to concatenation to remove the header rows.
Similarly, for the JSON export, the following query was used:
select DATE, ImageURL, DocumentIdentifier, RawJSON from ( SELECT DATE, ImageURL, DocumentIdentifier, RawJSON, ROW_NUMBER() OVER(PARTITION BY ImageURL) rowid FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` WHERE ( LOWER(JSON_EXTRACT(RawJSON, "$.responses[0].labelAnnotations")) like '%protest%' or LOWER(JSON_EXTRACT(RawJSON, "$.responses[0].labelAnnotations")) like '%demonstration%' or LOWER(JSON_EXTRACT(RawJSON, "$.responses[0].labelAnnotations")) like '%rebellion%' ) ) where rowid=1