GDELT's Visual Global Knowledge Graph (VGKG) added the output of Google Cloud Vision API's "Web Detection" capability on March 13, 2017, meaning that for the last two and a half years, the more than half billion worldwide news images GDELT has analyzed through the Cloud Vision API have been essentially reverse image searched across the open web, with the resulting most common descriptive entities found in their captions and descriptive text across the web recorded.
Through BigQuery's built-in JSON parsing support, a single SQL query is all that is required to parse this massive repository of descriptive entities to compile a list of the 22.5 million distinct entities that the API returned for these half billion images, taking just 4.7 minutes to process 12.2TB of JSON.
The top 10 most common Web entities are seen below, with their textual label, Google-assigned unique ID code (called "entityID" by the Vision API and "MID" by the Natural Language API) and the number of images out of the total 542,855,525 images that were assigned this tag. All 22,504,682 distinct entities have a MID code, of which 18,335,933 are unique (some entities reflect alternative references to the same underlying concept/entity).
Donald Trump's omnipresence in the global media over the past three years is clearly visible in his appearance in the list below as the #10-most-common topic found in the captions of worldwide online news imagery.
Note that Web Entities tags are based on textual captions NOT image recognition. Thus, an image labeled as "Donald Trump" might depict him or might depict events in Syria, with the textual caption describing those events as relating to his policy actions. Thus, it is important to recognize that Web Entities are very different from the Cloud Vision API's separate visual recognition abilities – these 19 million entities are applied based exclusively on finding other instances of a given image on the open web and counting up the most common entities in their textual captions.
In all, there were a total of 3,277,407,560 Web Entity assignments for the 542,855,525 images, making for an average of 6 entities per image (though in reality some images have many entities and some have very few).
Entity | MID | Matching Images |
News | /m/0zgqq52 | 27427703 |
Image | /m/0jg24 | 20535937 |
Car | /m/0k4j | 17057479 |
Vehicle | /m/0j9krmw | 14041299 |
Photograph | /m/0sqm624 | 11847238 |
Public Relations | /m/0gh6x4c | 10825803 |
Business | /m/0wbkymt | 10403526 |
Product | /m/067qr | 9500564 |
Police | /m/0w1kxd5 | 9444988 |
Donald Trump | /m/0np8yls | 8905466 |
The complete list of all 22.5 million entities is available as a UTF8 CSV file:
- Master List Of Web Entities In The VGKG 2017-2019. (460MB compressed / 840MB uncompressed).
TECHNICAL DETAILS
Despite having to parse 12.2TB of JSON, the analysis above results from a single SQL query that took just 3.4 minutes to execute. To parse the JSON array in which Web Entities are stored, a temporary function is used, courtesy of Felipe Hoffa.
CREATE TEMP FUNCTION json2array(json STRING) RETURNS ARRAY<STRING> LANGUAGE js AS """ return JSON.parse(json).map(x=>JSON.stringify(x)); """; select APPROX_TOP_COUNT(entity, 1)[OFFSET(0)].value entity, mid, count(1) Count from ( WITH nested AS ( SELECT DocumentIdentifier url, json2array(JSON_EXTRACT(RawJSON, "$.responses[0].webDetection.webEntities")) entities FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` where JSON_EXTRACT(RawJSON, "$.responses[0].webDetection.webEntities") is not null ) select JSON_EXTRACT_SCALAR(entity,'$.description') entity, JSON_EXTRACT_SCALAR(entity,'$.entityId') mid from nested, unnest(entities) entity ) where entity is not null group by mid