GDELT's Global Frontpage Graph (GFG) launched in March of last year, accumulating more than 35 billion URLs in just its first five months. As the GFG hits its 1-year anniversary, it has now reached 103 billion outlinks from the world's news homepages to 650,755,970 distinct URLs.
Even at this immense size, it took BigQuery just 910 seconds (15.16 minutes) to deduplicate all 103 billion URLs, averaging 114 million URLs a second and 7GB a second processing speed with the single line of SQL below (NOTE that this will consume 6.4TB of your monthly query quota and requires Standard SQL):
SELECT count(distinct ToLinkURL) FROM `gdelt-bq.gdeltv2.gfg_partitioned`
Once again BigQuery makes it trivial to work with massive datasets with ease.