The GDELT Project

A Compilation Of All EXIF Tags In The Visual Global Knowledge Graph In 2020

Since its launch in December 2015, the Visual Global Knowledge Graph (VGKG) has analyzed more than 600 million images from online news media around the world in 65 languages through Google's Cloud Vision API. Since August 2016, every image is scanned using Phil Harvey's Image::ExifTool implementation of his ExifTool library to extract all available embedded metadata from each image, which since November 2016 has included the full tag specifier with group family. Over time, most major CMS platforms used by news outlets, including WordPress, have evolved to preserve embedded metadata through the entire image pipeline, including resizing, overlays and other adjustments, meaning that an ever-growing fraction of news imagery includes recoverable metadata of some kind.

How might we use this massive dataset to understand the kinds of embedded metadata most commonly found in news images around the world and the hidden world of visual information available?

From January 1, 2020 through November 8, 2020, the VGKG analyzed 43,333,096 images, totaling 2.2TB of JSON annotations (this includes Cloud Vision API's output, the EXIF output and all other annotation blocks). All VGKG records include an EXIF section produced by ExifTool, even if it just contains the basic image properties like its height and width and thus not all of the tags below originated inside the image itself – some are "computed" tags created by ExifTool. Using a single SQL query in BigQuery, we produced a master histogram of every single tag extracted by the ExifTool library from all 43.3 million images seen by GDELT this year, along with the number of images it was seen it. The final histogram can be downloaded below in CSV format, totaling 194,393 unique tags:

We're excited to see what you might be able to do with this incredibly rich dataset of in-the-wild global tag distribution!

TECHNICAL DETAILS

Computing the histogram above was trivial using BigQuery. Each VGKG record includes a number of parsed fields, with the full raw JSON output of the Cloud Vision API stored in the last field. To the API's output we append several additional JSON blocks, including an "EXIF" block containing the full output of ExifTool. Thus, using BigQuery's JSON_EXTRACT operator we can trivially extract this block. We then use a TEMP FUNCTION to apply a custom bit of JavaScript to parse that block and parse the results into an array of objects (STRUCT's in BQ parlance) that are passed back to our SQL where we can display them. Here's an example that lists each image and the corresponding EXIF metadata found within, flattened to return each metadata tag as its own row.

WARNING: expanding the date selectors in the queries below to all of 2020 will consume 2.2TB of your query quota.

CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRUCT<name STRING,value STRING>>
LANGUAGE js AS """
var obj = JSON.parse(json);
var result = [];
var obj = JSON.parse(json); for(var i in obj) result.push({name: i, value: obj [i]});
return result;
""";
select ImageURL, exifrec.name, exifrec.value from (
SELECT ImageURL, json2array(JSON_EXTRACT(RawJSON, "$.EXIF")) exifrecs FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` WHERE DATE(_PARTITIONTIME) = "2020-11-08" and RawJSON like '%EXIF%' LIMIT 10
), unnest (exifrecs) exifrec

Of course, we can trivially modify the query above to instead compute a simple histogram, which is what we used to generate the CSV file above.

CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRUCT<name STRING,value STRING>>
LANGUAGE js AS """
var obj = JSON.parse(json);
var result = [];
var obj = JSON.parse(json); for(var i in obj) result.push({name: i, value: obj [i]});
return result;
""";
select exifrec.name, count(1) count from (
SELECT json2array(JSON_EXTRACT(RawJSON, "$.EXIF")) exifrecs FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` WHERE DATE(_PARTITIONTIME) >= "2020-11-08" and RawJSON like '%EXIF%' 
), unnest (exifrecs) exifrec group by exifrec.name order by count desc

That's all there is to it! Once again, BigQuery makes it trivial to rapidly analyze massive datasets with ease!