Shortly after the launch of the GDELT Global Frontpage Graph (GFG) last March we explored the average lifespan of a link on a news website homepage globally using three weeks of data. Now that the GFG encompasses 10 months and 76 billion links, we've decided to rerun that analysis using this vastly larger dataset. Using just a single line of SQL, Google BigQuery was able to process the entire 10TB dataset, using its columnar storage format to process just the 4.68TB of URL data in 549 seconds, achieving a speed of 138 million URLs per second and 8.5GB per second in processing.
Over the last 10 months the GFG has recorded 76 billion outlinks to 488 million distinct URLs from across 50,000 news outlet homepages worldwide in 65 languages. In all, 25.7% of those links appeared on the homepage for less than an hour, 36.9% for less than two hours, 41.2% for less than three hours, 45.6% for less than four hours, 47.6% for less than five hours and 50.5% for less than six hours. Twelve hours after appearing, 58.7% of links were gone and 69.8% were gone within 24 hours. Within two days, 79.6% of links were gone, 84.6% were gone in three days, 87.6% in four days, 89.5% in five days, 90.9% in six days and 92.1% lasted less than a week. Within a month 97.7% of links were gone, rising to 98.6% within two months and 98.9% within three months.
Just 1.11% of all links lasted longer than three months and typically comprise standing template links like About pages, navigation pages and so on. When we first launched the GFG, one of the questions we faced was whether we should make a massive investment in a large caching infrastructure that could track every link we saw over time and automatically remove them after they had appeared a certain number of days, to remove these kinds of template links. However, given the incredibly small number of links that actually form this final class of links and the importance of these links for template analyses, cross-link network generation, social media account tracking and so on, it is clear that including them is worth it.
The timeline below shows the life expectancy curve of homepage links across the 50,000 sites monitored by the GDELT Global Frontpage Graph at an hourly resolution. The X axis is the number of hours a link lasted on the homepage after it first appeared and the Y axis shows how many links lasted that long. The sharp exponential falloff shows just how rapidly news homepages change throughout the day. Links lasting longer than three months are combined together at the end of the graph, accounting for the burst there.
The graph below zooms into the first 168 hours (7 days) showing how rapidly lifespans level off.
The graph below shows the same timeline, but instead of plotting the raw number of links by hourly lifespan, it shows the cumulative percentage of links that have disappeared by day. As noted earlier, by the end of first day after a link has been posted to a homepage, 69.8% of those links have been replaced.
Zooming into the firstĀ 168 hours (7 days), the timeline below shows just how rapidly homepages change over the course of a typical day.
Putting this all together, with a single line of SQL and the GFG's massive global catalog, we can for the first time perform planetary scale explorations of the digital ecology of online journalism.
TECHNICAL DETAILS
All of the timelines below come from the results of a single SQL query using Google BigQuery (NOTE that this will consume 4.68TB of your monthly quota and requires Standard SQL), seen below. This query comes from the set of sample queries we published when the GFG was first released.
SELECT numhours,count(1) FROM ( select ToLinkURL, count(1) numhours from ( SELECT ToLinkURL,DATE FROM `gdelt-bq.gdeltv2.gfg_partitioned` ) group by ToLinkURL ) group by numhours order by numhours asc