Visualizing The Network Of Image Sharing Across The Global News Landscape

Kalev Leetaru

6 years ago

Each image in the Visual Global Knowledge Graph (VGKG) is analyzed by Google Cloud Vision API's "WebEntities" that performs the equivalent of a reverse image search across the open web to find highly similar images, including identical copies of an image appearing elsewhere on the web. Thinking of this data as a graph, we have a graph connecting each VGKG record (which records the first article GDELT saw that image in) with Cloud Vision's list of 200 webpages from the open web that also contained that same image. We can use this graph to cluster news outlets and other domains through their shared use of images. Note that the list of returned pages from the web is not necessarily ordered in any particular way and represents only a sample of matching articles, but nonetheless at scale offers some fascinating insights.

Using a single SQL query in BigQuery, we analyzed all 44.5M VGKG images in 2020, totaling 2.2TB of JSON annotations, constructing a graph that connected the domain of the first news article we saw an image in to the domains of the first 200 webpages on the open web that Cloud Vision found the same image on, taking just 1 minute 50 seconds. We retained just those connections that appeared at least 100 times, yielding 2,505,176 total graph edges.

You can download the complete dataset below:

Download Complete 2.5M Edge Graph. (30MB compressed / 150MB uncompressed)

You can also download the Gephi file used to visualize the top 10,000 edges below:

Top 10K Edges Gephi File.

The top 10,000 edges were extracted and visualized in Gephi, showing strong clustering:

The image below shows the same graph, but with labels, showing how most of the clustering is geographic and linguistic and that despite the visual language of images supposedly transcending the language of the written word, they still exhibit strong clustering (click on the image to view full-screen):

TECHNICAL DETAILS

For those interested in exploring this approach themselves, this is the SQL query used to generate the graph above:

CREATE TEMP FUNCTION json2array_url(json STRING)
RETURNS ARRAY<STRUCT<name STRING,url STRING>>
LANGUAGE js AS """
var obj = JSON.parse(json);
var result = [];
var obj = JSON.parse(json); for(var i in obj) result.push({name: i, url: obj[i].url});
return result;
""";
select PageDomain Source, SimilarImagePageDomain Target, Count, "Undirected" Type, ( Count/SUM(Count) OVER () ) Weight from (
select PageDomain, NET.REG_DOMAIN(rec.url) SimilarImagePageDomain, Count(1) Count from (
SELECT NET.REG_DOMAIN( DocumentIdentifier) PageDomain, json2array_url(JSON_EXTRACT(RawJSON, "$.responses[0].webDetection.pagesWithMatchingImages")) recs FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` WHERE DATE(_PARTITIONTIME) >= "2020-01-01" order by date desc
), unnest(recs) rec group by PageDomain, SimilarImagePageDomain having Count >= 100 and PageDomain != SimilarImagePageDomain 
) order by Count desc

We're excited to see what kinds of new lines of inquiry this analysis inspires!