Building A "You Are Here" Map Of The Global Media System Through Outlink Graphing

What if we could build a "you are here" map of the global media system by constructing an enormous graph over all of the outlinks from each news article published this year around the world seen by GDELT? In essence, we would take all of the articles published by each outlet, compile all of the links from those articles to other websites and compile a list of the top domains most commonly linked to.

This includes a mix of citation and commentary links but overall, if a site is repeatedly linked to by a news outlet, there is an interesting connection there to be explored in more detail. It is important to remember that not all news articles contain links to other websites and that many outlets, especially wire services and specialty outlets don't always include links in their articles, while outlinking is not as common in the media of some parts of the world. In all, around 71 million links are considered here, with around 38,000 news outlets having sufficient outlinking to warrant inclusion in this list.

By looking at the kinds of sites that most commonly link to a given outlet we can start to cluster outlets together based not on arbitrary assessments of "similar" news but on how the outlets themselves view each other and who they cite most often. At a macro level, the structure of this linking behavior allows us to begin to understand the macro-level media landscape through the eyes of the media itself.

You can download the final CSV dataset here, which includes a row for each news outlet, with the outlet in the first column and a comma-separated list of top 50 domains it links to in the second column.

TECHNICAL DETAILS

The final query used to construct the graph can be seen below, taking just 9.3 seconds to construct the final output:

SELECT fromsite, STRING_AGG(tosite ORDER BY totlinks DESC LIMIT 50) topoutlinkdomains FROM (
select fromsite,tosite,count(distinct day) numdays, sum(cnt) totlinks from (
select fromsite, tosite, day, count(1) cnt from (
select day, fromsite, NET.REG_DOMAIN(link) tosite from (
SELECT SUBSTR(CAST(DATE as string), 0, 8) day, SourceCommonName fromsite, SPLIT(REGEXP_EXTRACT(Extras, r'<PAGE_LINKS>(.*?)</PAGE_LINKS>'), ';') tolinks FROM `gdelt-bq.gdeltv2.gkg_partitioned` where DATE(_PARTITIONTIME) >= '2020-01-01'
), unnest(tolinks) as link
) group by fromsite,tosite,day having fromsite != tosite
) group by fromsite,tosite 
) group by fromsite