Since the launch of GDELT's Global Frontpage Graph (GFG) on March 2nd of this year, the dataset has rapidly evolved into one of the world's largest open data URL datasets and the largest catalog ever created of the hourly editorial decisions of the world's news outlets across the planet.
To get a sense of just how massive this dataset is, in just five months it has recorded more than 35 billion outlinks from the world's news outlet homepages to more than 240 million unique URLs, making it larger after just five months than Facebook's entire internal archive of every URL posted by at least 20 people and at least once publicly on its platform over more than a year and a half. If you thought social media datasets were large, its only because there's never been an attempt like GDELT's to catalog the world at these scales.
Over the coming months we will be debuting a series of massive new analyses that combine the GFG with our other datasets, including several soon to be announced, that we hope will reshape how we think about how to conduct social science research at these scales and transform how we understand our global news ecosystem.
The entire GFG dataset is available as both an hourly CSV file archive and a BigQuery table. Make absolutely certain to use table decorators when querying the BigQuery table due to its large size (4.6TB and counting).
As a simple example of just how powerful BigQuery is, the 240 million unique URL number above was calculated from a single line of SQL that took just 4 minutes to deduplicate 35 billion entries (NOTE – this query will consume 2.15TB of your query quota and requires Standard SQL):
SELECT count(distinct ToLinkURL) FROM `gdelt-bq.gdeltv2.gfg_partitioned`
Welcome to the power of BigQuery combined with one of the world's largest open data URL datasets!