A First Look At Three Weeks Of The Global Frontpage Graph (GFG) Through BigQuery

The newest addition to the GDELT family, the GDELT 3.0 Global Frontpage Graph (GFG) has been running every hour for three weeks now, generating enough historical data for us to do our first cursory exploration of what kinds of insights we can garner from the 5.5 billion links it has recorded! The full dataset is now available as a partitioned table in Google's BigQuery platform as gdelt-bq:gdeltv2.gfg_partitioned, allowing us to rapidly explore what life in the world's news homepages looks like in March 2018!

In all, from March 2, 2018 to the end of March 23, 2018, the GFG recorded 46,048,252 unique hyperlinks appearing a total of 5,468,011,467 times.

Of course, one of the most basic questions is how long a typical link lasts on a news homepage. The answer is that 33.3% of unique links appear for less than an hour, 50% last for under 5 hours, 59.8% for under 12 hours and 72.9% for under 24 hours. By 48 hours 80.9% of links have vanished, rising to 84.3% of links within 72 hours of first appearance. Just over 90% of links are gone within 10 days.

The graph below shows the life expectancy curve for URLs on the 50,000 global news homepages monitored by the GDELT Global Frontpage Graph. The X axis is hours from 1 to 480 and the Y axis is how many URLs appeared for that many hours between first appearance and disappearing from the homepage. In other words, this shows how many URLs lasted for  a single hour, 2 hours, 3 hours, 4 hours, 5 hours, etc through the full 480 hours in the time period. Immediately clear is that the overwhelming majority of URLs last just a few hours, with 24 hours marking the next major falloff, then 48 hours, with a small tail at the end representing template URLs that rarely change.

One of the questions we've been exploring is whether we should exclude "template" links that are always present on the homepage and never change (like the navigation, header and footer sections). Many of you have told us these are important to you to observe changes in those sections. More to the point, however, it turns out that just 3.6% of unique links lasted for the entire 3 weeks, with a total of around 8% of links appearing for the majority of the three weeks (allowing for a few intermittent outages of a given site). This means that in all, such rarely changing template links account for a very small percentage of the totality of the links on each homepage.

The high percentage of very short lived links (33.3% of links lasting less than an hour and half of links lasting less than 5 hours) suggests that web archiving efforts must recrawl news homepages at a rapid rate and that even hourly crawls are likely missing content.

 

EXAMPLE QUERIES

The complete Global Frontpage Graph is available as both tab delimited files and a BigQuery table. The BigQuery table, gdelt-bq:gdeltv2.gfg_partitioned, lets you very rapidly perform all sorts of explorations of the world's news homepages. Some example queries to get you started appear below. NOTE that a number of the queries below use partition operators to limit the data consumed to just a single day to reduce the amount of data processed to avoid exceeding your monthly free BigQuery quota.

Count Total Links

This reports the total number of links recorded to date by the GFG.

SELECT count(1) FROM [gdelt-bq:gdeltv2.gfg_partitioned] where _PARTITIONTIME >= "2018-03-02 00:00:00" AND _PARTITIONTIME < "2018-03-03 00:00:00"

Count Links Per Hour

This counts the total number of links recorded by hour.

SELECT DATE, count(1) FROM [gdelt-bq:gdeltv2.gfg_partitioned] where _PARTITIONTIME >= "2018-03-02 00:00:00" AND _PARTITIONTIME < "2018-03-03 00:00:00" group by DATE order by DATE desc

Count Distinct Links

This counts the total number of unique links. NOTE that this query requires switching to BigQuery Standard SQL instead of Legacy SQL and can consume a very large amount of quota if run for long time periods (338GB to process 20 days of data).

SELECT COUNT(DISTINCT ToLinkURL) FROM `gdelt-bq.gdeltv2.gfg_partitioned` WHERE _PARTITIONTIME >= "2018-03-02 00:00:00" AND _PARTITIONTIME < "2018-03-03 00:00:00" 

Get All Facebook Links

This returns all of the distinct links that contain "facebook.com" in them.

SELECT FromFrontPageURL, ToLinkURL FROM [gdelt-bq:gdeltv2.gfg_partitioned] where ToLinkURL like '%facebook.com/%' and _PARTITIONTIME >= "2018-03-02 00:00:00" AND _PARTITIONTIME < "2018-03-03 00:00:00" group by FromFrontPageURL, ToLinkURL order by ToLinkURL

Percent Of CNN's Links That Mention "Trump"

This returns the percent of CNN's homepage links that contain the keyword "trump" somewhere in the link text. Note that this requires the literal phrase "trump" to appear somewhere in the first 100 characters of the link text (a link about the "white house" or "administration" won't be flagged), so to count the total discussion about President Trump would require a more extensive list of keywords and also the translations of those keywords into all of the languages of interest since the link text is recorded as-is without machine translation).

select DATE, tot, trump, trump/tot*100 perc from (
select DATE, count(1) tot , sum(ToLinkURL like '%trump%') trump from [gdelt-bq:gdeltv2.gfg_partitioned] where
 FromFrontPageURL like '%cnn.com%' and _PARTITIONTIME >= "2018-03-02 00:00:00" AND _PARTITIONTIME < "2018-03-03 00:00:00" group by DATE order by DATE ignore case
) order by DATE desc

Percent Of CNN's Links That Mention "Trump" In The Top Half Of The Page

This is the same query as above, but uses the LinkPercentMaxID field to limit the search to just the first half of links on the homepage (typically the first half of the page) rather than all links on the page.

select DATE, tot, trump, trump/tot*100 perc from (
select DATE, count(1) tot , sum(ToLinkURL like '%trump%') trump from [gdelt-bq:gdeltv2.gfg_partitioned] where
 FromFrontPageURL like '%cnn.com%' and LinkPercentMaxID < 50 and _PARTITIONTIME >= "2018-03-02 00:00:00" AND _PARTITIONTIME < "2018-03-03 00:00:00" group by DATE order by DATE ignore case
) order by DATE desc

Counting How Many Links Survive Each Hour

The query below is limited to a single day, but when run over the entire 5.5 billion record dataset the query takes just 96 seconds and consumes 366GB.

SELECT numhours,count(1) FROM (
select ToLinkURL, count(1) numhours from (
SELECT ToLinkURL,DATE FROM [gdelt-bq:gdeltv2.gfg_partitioned] WHERE _PARTITIONTIME >= "2018-03-02 00:00:00" AND _PARTITIONTIME < "2018-03-03 00:00:00" group by ToLinkURL,DATE
) group by ToLinkURL
) group by numhours order by numhours asc

 

Hopefully these examples have given you some great ideas for how to explore this incredibly powerful new dataset!