Visual Global Knowledge Graph (VGKG) February 2016 Snapshot Dataset

Given the enormous interest we've been receiving in the new GDELT Visual Knowledge Graph (VGKG), which processes an ever-growing fraction of all global news imagery monitored by GDELT through Google's "deep learning" Cloud Vision API, we've boxed up the 14.6 million images we've processed to date and made them available as a downloadable CSV file for research. Of those images, the most recent 3.9 million include the complete raw JSON output of the Cloud Vision API, which includes a wealth of additional characteristics about each image, including facial landmarks and the color profile of the image.

This is a one-time snapshot of the VGKG archive as of noon EST on February 21, 2016 designed to make it easy for researchers to experiment with the data using traditional statistical tools. Remember that the live VGKG is updated every 15 minutes and is available both as CSV files and through Google BigQuery.

NOTE that the documentation PDF below refers to VGKG 1.0 Alpha, while the current release is VGKG 1.0. The format is nearly identical with the addition of three new columns at the end: LangHints (the language code hints provided to the Cloud Vision API OCR engine), WidthHeight (pixel dimensions of the image downscaled to <1500×1500 pixels) and RawJSON, which is the raw JSON output of the Cloud Vision API.


Remember that the VGKG is considered an "Early Release" dataset and the use of deep learning algorithms to catalog arbitrary images from almost every corner of the planet on almost every topic imaginable is still highly experimental, so you will almost certainly encounter a certain level of error in the categorizations and other information computed about each image. Remember that all tags are applied 100% automatically with NO HUMAN INTERVENTION and mistaken categorizations or other tags represent computer algorithm errors, NOT editorial statements. A small additional degree of error may also be introduced by the image caching and fingerprinting used by GDELT to recognize duplicate images and reduce the total volume of material it must submit to the Cloud Vision API.