A 3.8-Billion-Edge Entity Graph From Half A Billion Global News Images 2017-2019

Last week we released a massive new dataset of the 22.5 million distinct Web Entities identified by Google Cloud Vision API's "Web Detection" capability in the half billion global news images monitored by GDELT from March 2017 (when GDELT first added the Web Detection output from the Cloud Vision API). Unlike traditional image labels, which are assigned through image understanding algorithms, Web Entities are assigned through what amounts to a reverse image search, in which Cloud Vision API searches the open web for other appearances of a given image or images highly similar to it and compiles a list of the top entities mentioned most commonly in those textual captions.

This allows the API to gain a rich understanding of the visual world beyond what any visual understanding algorithm could encode. In this case that means more than 22.5 million distinct entities assigned to those half billion images.

How often do those entities co-occur and which entities appear most commonly with which other entities? While entity graphs often encode hierarchical semantic relationships among entities, the far more interesting question revolves around how those entities end up being related in the real-world. Given that Web Entities reflect the incredibly unique environment of entities co-occurring with entities in the context of image captions, they reflect entities that are related through the visual world, rather than textual. This is an extraordinarily unique form of co-occurrence context that has not historically been available for network analysis.

To explore this further, a single SQL query was all that was required for BigQuery to process the 94,890,972,482 total Web Entity co-occurrences and condense them down to the final set of 3,833,422,503 distinct pairings in just under 2.8 hours. BigQuery's ability to condense down 94 billion co-occurrences into a final network of 3.8 billion distinct edges in just 2.8 hours with a single SQL query is nothing short of extraordinary and reminds us just how powerful of a tool BigQuery is for allowing us to peer into the very soul of global society.

The complete 3.8-billion-edge graph is available for download as a UTF8 CSV file in Gephi's edge file format of "Source,Target,RawCount,Type,Weight" where weight is the RawCount value divided by 94,890,972,482 (the total number of co-occurrences):

Master 3.8-Billion-Edge Web Entities Co-Occurrence Graph. (70GB compressed / 260GB uncompressed)

The complete graph is also available as a table in BigQuery:

gdelt-bq:gdeltv2.cloudvision_webentitygraph_20191020

For those interested in a smaller and more manageable graph to start with, we've also created a reduced version of the graph above that includes only edges for which the given pair of entities appeared together in more than 100 images:

Reduced 59-Million-Edge Web Entities Co-Occurrence Graph (>100 Edges). (1.4GB compressed / 3.8GB uncompressed)

The complete graph can trivially answer all kinds of relatedness questions. For example, the query below returns the top 1,500 entities that co-occur most commonly with "Donald Trump":

SELECT * FROM `gdelt-bq.gdeltv2.cloudvision_webentitygraph_20191020` where Source like '%Donald Trump%' or Target like '%Donald Trump%' order by RawCount desc limit 1500

In just 5 seconds it yields a subgraph that can be imported directly into Gephi. Here are the top 15 entries:

Source	Target	RawCount	Type	Weight
Donald Trump	President of the United States	11002795	Undirected	0.000116
Donald Trump	United States of America	7771155	Undirected	0.000082
Donald Trump	United States	7096119	Undirected	0.000075
Donald Trump	News	6482471	Undirected	0.000068
Donald Trump	White House	4485541	Undirected	0.000047
Donald Trump	Melania Trump	3922680	Undirected	0.000041
Donald Trump	President	3821366	Undirected	0.000040
Donald Trump	Republican Party	3365251	Undirected	0.000035
Donald Trump	First Lady of the United States	2936931	Undirected	0.000031
Donald Trump	Image	2792894	Undirected	0.000029
Donald Trump	Presidency of Donald Trump	2640194	Undirected	0.000028
Barack Obama	Donald Trump	2365201	Undirected	0.000025
Donald Trump	North Korea	2365183	Undirected	0.000025
Donald Trump	Politician	2120299	Undirected	0.000022
Democratic Party	Donald Trump	1977959	Undirected	0.000021

Similarly, repeating the same query for "Putin" can be done with the following SQL:

SELECT * FROM `gdelt-bq.gdeltv2.cloudvision_webentitygraph_20191020` where Source like '%Putin%' or Target like '%Putin%' order by RawCount desc limit 1500

Which yields a similar Gephi-ready subgraph. The top 15 edges in tabular form are:

Source	Target	RawCount	Type	Weight
Russia	Vladimir Putin	2328384	Undirected	0.000025
President	Vladimir Putin	1279107	Undirected	0.000013
President of Russia	Vladimir Putin	1245849	Undirected	0.000013
Donald Trump	Vladimir Putin	929372	Undirected	0.000010
News	Vladimir Putin	922960	Undirected	0.000010
President of the United States	Vladimir Putin	644121	Undirected	0.000007
Politician	Vladimir Putin	622201	Undirected	0.000007
United States of America	Vladimir Putin	525127	Undirected	0.000006
United States	Vladimir Putin	516293	Undirected	0.000005
Moscow	Vladimir Putin	459695	Undirected	0.000005
Election	Vladimir Putin	372167	Undirected	0.000004
Ukraine	Vladimir Putin	326908	Undirected	0.000003
Politics	Vladimir Putin	283872	Undirected	0.000003
Syria	Vladimir Putin	258650	Undirected	0.000003
Military	Vladimir Putin	231685	Undirected	0.000002

We're tremendously excited to see what you're able to do with this graph!

TECHNICAL DETAILS

Despite its enormous size, constructing the graph above took just a single SQL query and just 2.8 hours. A special thanks to Felipe Hoffa for the "json2array" function below used to parse the JSON arrays.

CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return JSON.parse(json).map(x=>JSON.stringify(x));
""";

SELECT Source, Target, Count RawCount, "Undirected" Type FROM (
SELECT a.entity Source, b.entity Target, COUNT(*) as Count
FROM (

WITH nested AS (
SELECT DocumentIdentifier url, json2array(JSON_EXTRACT(RawJSON, "$.responses[0].webDetection.webEntities")) entities FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` WHERE
JSON_EXTRACT(RawJSON, "$.responses[0].webDetection.webEntities") is not null
) select url, JSON_EXTRACT_SCALAR(entity,'$.description') entity from nested, unnest(entities) entity

) a
JOIN (

WITH nested AS (
SELECT DocumentIdentifier url, json2array(JSON_EXTRACT(RawJSON, "$.responses[0].webDetection.webEntities")) entities FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` WHERE
JSON_EXTRACT(RawJSON, "$.responses[0].webDetection.webEntities") is not null
) select url, JSON_EXTRACT_SCALAR(entity,'$.description') entity from nested, unnest(entities) entity

) b
ON a.url=b.url
WHERE a.entity<b.entity
GROUP BY 1,2
)

The GDELT Project

A 3.8-Billion-Edge Entity Graph From Half A Billion Global News Images 2017-2019

TECHNICAL DETAILS

Archives