VGKG: Master Ranked List Of 16,500 Visual Labels From Half A Billion Global News Images 2017-2019

The more than 19 million Web Entities identified by Google Cloud Vision API's "Web Detection" capability from the half billion images in the VGKG dataset were compiled exclusively from the textual captions associated with those images. In essence, the Cloud Vision API performs the equivalent of a reverse image search for each image over the open web and compiles a list of the topics and entities mentioned in the textual captions of those images. These Web Entities do not involve any form of visual assessment of the images, they are based exclusively on what amounts to a crowdsourced look at how the given image is described across the web.

Of course, the Cloud Vision API also supports direct visual assessment of images, in which deep learning image understanding algorithms are brought to bear to yield a set of metadata topic tags describing the objects and activities visually depicted in the image.

The Cloud Vision API has applied 16,528 distinct annotation "labels" to the half billion images in the VGKG. The top 10 most common labels are seen below, along with their textual label, Google-assigned unique ID code (called "entityID" for Web Entities and "MID" for labels) and the number of images out of the total 542,855,525 images that were assigned this label. Note that while GDELT does not actively focus on sports coverage, its centrality to the global news ecosystem can be seen in its prominence below.

In all, there were a total of 4,313,022,055 label assignments for the 542,855,525 images, making for an average of 8 labels per image (though in reality some images have many labels and some have very few).

Entity MID Count
person /m/01g317 74915448
vehicle /m/07yv9 57247409
profession /m/063km 53140729
product /m/01jwgf 45280346
font /m/03gq5hm 34896284
official /m/035y33 34719756
event /m/081pkj 33213904
sports /m/06ntj 30430574
speech /m/09x0r 29979333
car /m/0k4j 28759748

Master List Of Labels In The VGKG 2015-2019. (400K).The complete list of all 16,528 entities is available as a UTF8 CSV file:

 

TECHNICAL DETAILS

Despite having to parse 12.2TB of JSON, the analysis above results from a single SQL query that took just 1.8 minutes to execute. To parse the JSON array in which Label annotations are stored, a temporary function is used, courtesy of Felipe Hoffa.

CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  return JSON.parse(json).map(x=>JSON.stringify(x));
""";
select APPROX_TOP_COUNT(entity, 1)[OFFSET(0)].value entity, mid, count(1) Count from (
WITH nested AS (
SELECT DocumentIdentifier url, json2array(JSON_EXTRACT(RawJSON, "$.responses[0].labelAnnotations")) entities FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` where
JSON_EXTRACT(RawJSON, "$.responses[0].labelAnnotations") is not null
) select JSON_EXTRACT_SCALAR(entity,'$.description') entity, JSON_EXTRACT_SCALAR(entity,'$.mid') mid from nested, unnest(entities) entity
) where entity is not null group by mid order by Count desc