VGKG: Master Ranked List Of 22.5 Million Web Entities From Half A Billion Global News Images 2017-2019

GDELT's Visual Global Knowledge Graph (VGKG) added the output of Google Cloud Vision API's "Web Detection" capability on March 13, 2017, meaning that for the last two and a half years, the more than half billion worldwide news images GDELT has analyzed through the Cloud Vision API have been essentially reverse image searched across the open web, with the resulting most common descriptive entities found in their captions and descriptive text across the web recorded.

Through BigQuery's built-in JSON parsing support, a single SQL query is all that is required to parse this massive repository of descriptive entities to compile a list of the 22.5 million distinct entities that the API returned for these half billion images, taking just 4.7 minutes to process 12.2TB of JSON.

The top 10 most common Web entities are seen below, with their textual label, Google-assigned unique ID code (called "entityID" by the Vision API and "MID" by the Natural Language API) and the number of images out of the total 542,855,525 images that were assigned this tag. All 22,504,682 distinct entities have a MID code, of which 18,335,933 are unique (some entities reflect alternative references to the same underlying concept/entity).

Donald Trump's omnipresence in the global media over the past three years is clearly visible in his appearance in the list below as the #10-most-common topic found in the captions of worldwide online news imagery.

Note that Web Entities tags are based on textual captions NOT image recognition. Thus, an image labeled as "Donald Trump" might depict him or might depict events in Syria, with the textual caption describing those events as relating to his policy actions. Thus, it is important to recognize that Web Entities are very different from the Cloud Vision API's separate visual recognition abilities – these 19 million entities are applied based exclusively on finding other instances of a given image on the open web and counting up the most common entities in their textual captions.

In all, there were a total of 3,277,407,560 Web Entity assignments for the 542,855,525 images, making for an average of 6 entities per image (though in reality some images have many entities and some have very few).

Entity MID Matching Images
News /m/0zgqq52 27427703
Image /m/0jg24 20535937
Car /m/0k4j 17057479
Vehicle /m/0j9krmw 14041299
Photograph /m/0sqm624 11847238
Public Relations /m/0gh6x4c 10825803
Business /m/0wbkymt 10403526
Product /m/067qr 9500564
Police /m/0w1kxd5 9444988
Donald Trump /m/0np8yls 8905466

The complete list of all 22.5 million entities is available as a UTF8 CSV file:

 

TECHNICAL DETAILS

Despite having to parse 12.2TB of JSON, the analysis above results from a single SQL query that took just 3.4 minutes to execute. To parse the JSON array in which Web Entities are stored, a temporary function is used, courtesy of Felipe Hoffa.

CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  return JSON.parse(json).map(x=>JSON.stringify(x));
""";
select APPROX_TOP_COUNT(entity, 1)[OFFSET(0)].value entity, mid, count(1) Count from (
WITH nested AS (
SELECT DocumentIdentifier url, json2array(JSON_EXTRACT(RawJSON, "$.responses[0].webDetection.webEntities")) entities FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` where
JSON_EXTRACT(RawJSON, "$.responses[0].webDetection.webEntities") is not null
) select JSON_EXTRACT_SCALAR(entity,'$.description') entity, JSON_EXTRACT_SCALAR(entity,'$.entityId') mid from nested, unnest(entities) entity
) where entity is not null group by mid