A 3.8-Billion-Edge Entity Graph From Half A Billion Global News Images 2017-2019

Last week we released a massive new dataset of the 22.5 million distinct Web Entities identified by Google Cloud Vision API's "Web Detection" capability in the half billion global news images monitored by GDELT from March 2017 (when GDELT first added the Web Detection output from the Cloud Vision API). Unlike traditional image labels, which are assigned through image understanding algorithms, Web Entities are assigned through what amounts to a reverse image search, in which Cloud Vision API searches the open web for other appearances of a given image or images highly similar to it and compiles a list of the top entities mentioned most commonly in those textual captions.

This allows the API to gain a rich understanding of the visual world beyond what any visual understanding algorithm could encode. In this case that means more than 22.5 million distinct entities assigned to those half billion images.

How often do those entities co-occur and which entities appear most commonly with which other entities? While entity graphs often encode hierarchical semantic relationships among entities, the far more interesting question revolves around how those entities end up being related in the real-world. Given that Web Entities reflect the incredibly unique environment of entities co-occurring with entities in the context of image captions, they reflect entities that are related through the visual world, rather than textual. This is an extraordinarily unique form of co-occurrence context that has not historically been available for network analysis.

To explore this further, a single SQL query was all that was required for BigQuery to process the 94,890,972,482 total Web Entity co-occurrences and condense them down to the final set of 3,833,422,503 distinct pairings in just under 2.8 hours. BigQuery's ability to condense down 94 billion co-occurrences into a final network of 3.8 billion distinct edges in just 2.8 hours with a single SQL query is nothing short of extraordinary and reminds us just how powerful of a tool BigQuery is for allowing us to peer into the very soul of global society.

The complete 3.8-billion-edge graph is available for download as a UTF8 CSV file in Gephi's edge file format of "Source,Target,RawCount,Type,Weight" where weight is the RawCount value divided by 94,890,972,482 (the total number of co-occurrences):

The complete graph is also available as a table in BigQuery:

For those interested in a smaller and more manageable graph to start with, we've also created a reduced version of the graph above that includes only edges for which the given pair of entities appeared together in more than 100 images:

The complete graph can trivially answer all kinds of relatedness questions. For example, the query below returns the top 1,500 entities that co-occur most commonly with "Donald Trump":

SELECT * FROM `gdelt-bq.gdeltv2.cloudvision_webentitygraph_20191020` where Source like '%Donald Trump%' or Target like '%Donald Trump%' order by RawCount desc limit 1500

In just 5 seconds it yields a subgraph that can be imported directly into Gephi. Here are the top 15 entries:

Source Target RawCount Type Weight
Donald Trump President of the United States 11002795 Undirected 0.000116
Donald Trump United States of America 7771155 Undirected 0.000082
Donald Trump United States 7096119 Undirected 0.000075
Donald Trump News 6482471 Undirected 0.000068
Donald Trump White House 4485541 Undirected 0.000047
Donald Trump Melania Trump 3922680 Undirected 0.000041
Donald Trump President 3821366 Undirected 0.000040
Donald Trump Republican Party 3365251 Undirected 0.000035
Donald Trump First Lady of the United States 2936931 Undirected 0.000031
Donald Trump Image 2792894 Undirected 0.000029
Donald Trump Presidency of Donald Trump 2640194 Undirected 0.000028
Barack Obama Donald Trump 2365201 Undirected 0.000025
Donald Trump North Korea 2365183 Undirected 0.000025
Donald Trump Politician 2120299 Undirected 0.000022
Democratic Party Donald Trump 1977959 Undirected 0.000021

Similarly, repeating the same query for "Putin" can be done with the following SQL:

SELECT * FROM `gdelt-bq.gdeltv2.cloudvision_webentitygraph_20191020` where Source like '%Putin%' or Target like '%Putin%' order by RawCount desc limit 1500

Which yields a similar Gephi-ready subgraph. The top 15 edges in tabular form are:

Source Target RawCount Type Weight
Russia Vladimir Putin 2328384 Undirected 0.000025
President Vladimir Putin 1279107 Undirected 0.000013
President of Russia Vladimir Putin 1245849 Undirected 0.000013
Donald Trump Vladimir Putin 929372 Undirected 0.000010
News Vladimir Putin 922960 Undirected 0.000010
President of the United States Vladimir Putin 644121 Undirected 0.000007
Politician Vladimir Putin 622201 Undirected 0.000007
United States of America Vladimir Putin 525127 Undirected 0.000006
United States Vladimir Putin 516293 Undirected 0.000005
Moscow Vladimir Putin 459695 Undirected 0.000005
Election Vladimir Putin 372167 Undirected 0.000004
Ukraine Vladimir Putin 326908 Undirected 0.000003
Politics Vladimir Putin 283872 Undirected 0.000003
Syria Vladimir Putin 258650 Undirected 0.000003
Military Vladimir Putin 231685 Undirected 0.000002

We're tremendously excited to see what you're able to do with this graph!

TECHNICAL DETAILS

Despite its enormous size, constructing the graph above took just a single SQL query and just 2.8 hours. A special thanks to Felipe Hoffa for the "json2array" function below used to parse the JSON arrays.

CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return JSON.parse(json).map(x=>JSON.stringify(x));
""";

SELECT Source, Target, Count RawCount, "Undirected" Type FROM (
SELECT a.entity Source, b.entity Target, COUNT(*) as Count
FROM (

WITH nested AS (
SELECT DocumentIdentifier url, json2array(JSON_EXTRACT(RawJSON, "$.responses[0].webDetection.webEntities")) entities FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` WHERE
JSON_EXTRACT(RawJSON, "$.responses[0].webDetection.webEntities") is not null
) select url, JSON_EXTRACT_SCALAR(entity,'$.description') entity from nested, unnest(entities) entity

) a
JOIN (

WITH nested AS (
SELECT DocumentIdentifier url, json2array(JSON_EXTRACT(RawJSON, "$.responses[0].webDetection.webEntities")) entities FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` WHERE
JSON_EXTRACT(RawJSON, "$.responses[0].webDetection.webEntities") is not null
) select url, JSON_EXTRACT_SCALAR(entity,'$.description') entity from nested, unnest(entities) entity

) b
ON a.url=b.url
WHERE a.entity<b.entity
GROUP BY 1,2
)