Merging Billions of URLS Using BigQuery

One of the most amazing things about Google's BigQuery service is its nearly infinite scalability and the sheer speed at which it can perform tasks even on larger datasets. In just the last few weeks we've used BigQuery to explore billions of location mentions to map the hidden geography of language, mapped 6.2 billion location references across three quarters of a billion articles to map media geography and even created a 38 year timeline of global conflict in just three seconds.

Nearly every large scale analysis we perform today involves BigQuery in some fashion, from full-scale analyses performed entirely in BigQuery to massive workflows that use BigQuery as a consolidation and data management platform.

One simple, but extremely powerful use case for BigQuery where we find it to be absolutely essential is in the merging of multi-billion row lists. For example, we will take several billion web pages and compute various kinds of fingerprinting hashes, using a large cluster of machines to stream data in from the web or Google Cloud Storage (GCS), compute the hashes and then write the hashes back to GCS. Given BigQuery's full integration with GCS, we can either leave the files as-is in GCS and create a BigQuery table that sources directly from the GCS CSV files (which allows the files to be updated in realtime) or simply load the files into a BigQuery native table for maximum performance.

Once the data is loaded into BigQuery, we can perform a range of data management tasks with astonishing speed. Basic tasks like calculating the number of unique hash values requires just a single line of SQL that takes just a few minutes even on a many-billion-row dataset. Want to take a 1.5-billion record dataset, count how many fingerprint overlaps it has with a second 1 billion row dataset and third 2 billion row dataset that aren't also overlaps with a fourth 2.5 billion row dataset? The resulting comparison takes just a single line of SQL and under 5 minutes, even when comparing multiple fingerprint fields and performing additional regular expressions on the URLs! Seeding massive data processing pipelines for large scale experiments is trivial in a BigQuery world!