Last week we released a massive new dataset of the 22.5 million distinct Web Entities identified by Google Cloud Vision API's "Web Detection" capability in the half billion global news images monitored by GDELT from March 2017 (when GDELT first added the Web Detection output from the Cloud Vision API). Unlike traditional image labels, which are assigned through image understanding algorithms, Web Entities are assigned through what amounts to a reverse image search, in which Cloud Vision API searches the open web for other appearances of a given image or images highly similar to it and compiles a list of the top entities mentioned most commonly in those textual captions.
This allows the API to gain a rich understanding of the visual world beyond what any visual understanding algorithm could encode. In this case that means more than 22.5 million distinct entities assigned to those half billion images.
How often do those entities co-occur and which entities appear most commonly with which other entities? While entity graphs often encode hierarchical semantic relationships among entities, the far more interesting question revolves around how those entities end up being related in the real-world. Given that Web Entities reflect the incredibly unique environment of entities co-occurring with entities in the context of image captions, they reflect entities that are related through the visual world, rather than textual. This is an extraordinarily unique form of co-occurrence context that has not historically been available for network analysis.
To explore this further, a single SQL query was all that was required for BigQuery to process the 94,890,972,482 total Web Entity co-occurrences and condense them down to the final set of 3,833,422,503 distinct pairings in just under 2.8 hours. BigQuery's ability to condense down 94 billion co-occurrences into a final network of 3.8 billion distinct edges in just 2.8 hours with a single SQL query is nothing short of extraordinary and reminds us just how powerful of a tool BigQuery is for allowing us to peer into the very soul of global society.
The complete 3.8-billion-edge graph is available for download as a UTF8 CSV file in Gephi's edge file format of "Source,Target,RawCount,Type,Weight" where weight is the RawCount value divided by 94,890,972,482 (the total number of co-occurrences):
- Master 3.8-Billion-Edge Web Entities Co-Occurrence Graph. (70GB compressed / 260GB uncompressed)
The complete graph is also available as a table in BigQuery:
For those interested in a smaller and more manageable graph to start with, we've also created a reduced version of the graph above that includes only edges for which the given pair of entities appeared together in more than 100 images:
- Reduced 59-Million-Edge Web Entities Co-Occurrence Graph (>100 Edges). (1.4GB compressed / 3.8GB uncompressed)
The complete graph can trivially answer all kinds of relatedness questions. For example, the query below returns the top 1,500 entities that co-occur most commonly with "Donald Trump":
SELECT * FROM `gdelt-bq.gdeltv2.cloudvision_webentitygraph_20191020` where Source like '%Donald Trump%' or Target like '%Donald Trump%' order by RawCount desc limit 1500
In just 5 seconds it yields a subgraph that can be imported directly into Gephi. Here are the top 15 entries:
|Donald Trump||President of the United States||11002795||Undirected||0.000116|
|Donald Trump||United States of America||7771155||Undirected||0.000082|
|Donald Trump||United States||7096119||Undirected||0.000075|
|Donald Trump||White House||4485541||Undirected||0.000047|
|Donald Trump||Melania Trump||3922680||Undirected||0.000041|
|Donald Trump||Republican Party||3365251||Undirected||0.000035|
|Donald Trump||First Lady of the United States||2936931||Undirected||0.000031|
|Donald Trump||Presidency of Donald Trump||2640194||Undirected||0.000028|
|Barack Obama||Donald Trump||2365201||Undirected||0.000025|
|Donald Trump||North Korea||2365183||Undirected||0.000025|
|Democratic Party||Donald Trump||1977959||Undirected||0.000021|
Similarly, repeating the same query for "Putin" can be done with the following SQL:
SELECT * FROM `gdelt-bq.gdeltv2.cloudvision_webentitygraph_20191020` where Source like '%Putin%' or Target like '%Putin%' order by RawCount desc limit 1500
Which yields a similar Gephi-ready subgraph. The top 15 edges in tabular form are:
|President of Russia||Vladimir Putin||1245849||Undirected||0.000013|
|Donald Trump||Vladimir Putin||929372||Undirected||0.000010|
|President of the United States||Vladimir Putin||644121||Undirected||0.000007|
|United States of America||Vladimir Putin||525127||Undirected||0.000006|
|United States||Vladimir Putin||516293||Undirected||0.000005|
We're tremendously excited to see what you're able to do with this graph!
Despite its enormous size, constructing the graph above took just a single SQL query and just 2.8 hours. A special thanks to Felipe Hoffa for the "json2array" function below used to parse the JSON arrays.
CREATE TEMP FUNCTION json2array(json STRING) RETURNS ARRAY<STRING> LANGUAGE js AS """ return JSON.parse(json).map(x=>JSON.stringify(x)); """; SELECT Source, Target, Count RawCount, "Undirected" Type FROM ( SELECT a.entity Source, b.entity Target, COUNT(*) as Count FROM ( WITH nested AS ( SELECT DocumentIdentifier url, json2array(JSON_EXTRACT(RawJSON, "$.responses.webDetection.webEntities")) entities FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` WHERE JSON_EXTRACT(RawJSON, "$.responses.webDetection.webEntities") is not null ) select url, JSON_EXTRACT_SCALAR(entity,'$.description') entity from nested, unnest(entities) entity ) a JOIN ( WITH nested AS ( SELECT DocumentIdentifier url, json2array(JSON_EXTRACT(RawJSON, "$.responses.webDetection.webEntities")) entities FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` WHERE JSON_EXTRACT(RawJSON, "$.responses.webDetection.webEntities") is not null ) select url, JSON_EXTRACT_SCALAR(entity,'$.description') entity from nested, unnest(entities) entity ) b ON a.url=b.url WHERE a.entity<b.entity GROUP BY 1,2 )