GDELT's Global Frontpage Graph (GFG) launched in March of last year, accumulating more than 35 billion URLs in just its first five months. As the GFG hits its ten month anniversary, it has now reached 76 billion outlinks from the world's news homepages to 487,575,443 distinct URLs, making it more than two and half times larger than the Facebook research outlink dataset.
Even at this immense size, it took BigQuery just 699 seconds (11.65 minutes) to deduplicate all 76 billion URLs, averaging 109 million URLs a second and 6.7GB a second processing speed with the single line of SQL below (NOTE that this will consume 4.67TB of your monthly query quota and requires Standard SQL):
SELECT count(distinct ToLinkURL) FROM `gdelt-bq.gdeltv2.gfg_partitioned`
Once again BigQuery makes it trivial to work with massive datasets with ease.