Using BigQuery As A Lens To Explore Large Datasets

One of the most incredible aspects of the modern cloud is the way in which it accelerates massive adhoc data analysis to near realtime speeds. For a forthcoming series of massive new analyses, we had the need to analyze a large social dataset totaling in the many billions of records and scattered across hundreds of thousands of small compressed files. Using GCP it was trivial to spin up a small cluster of 32-core 200GB RAM machines, each with a Local SSD array of 1.5TB. Streaming the vast scattered dataset onto this cluster and using GCS as the backing storage fabric, we were able to reprocess the entire dataset into cleaned and normalized daily-level files in a final production GCS directory a matter of hours.

Once the normalized dataset was in GCS we no longer needed all of the Local SSD disk and once again GCP made it trivial to spin down the cluster and spin it back up without the local disk and with less RAM per virtual core. Since the full normalized dataset was stored in GCS, it was trivial to run a range of algorithms including CLD2 language detection across the full dataset with linear scaling. We simply fired up worker threads, one per virtual core, across all of the machines and used a central dispatch server to coordinate among the workers. Each worker streamed a given daily file directly from GCS, processing each record as it arrived and streamed its results right back to a new GCS file using GCS' streaming write capability combined with mbuffer's buffering and pigz parallel gzip. This meant that despite the daily files being extremely large, only a 10GB local disk was needed to hold the OS, since no local storage was required.

The final 8-billion-row processed dataset, processed down to just a few normalized fields per record, was loaded into BigQuery with a single invocation of BigQuery's Python CLI, which allows wildcard file loading. With a single command, BigQuery loaded 8 billion records across 5TB into a new table in just under 2 minutes and 40 seconds. Once in BigQuery, all 8 billion rows could be explored, summarized and trend extracted with just a single line of SQL and a few seconds to a few tens of seconds for each query.

From start to finish, it took a few minutes to spin up a large high-IO cluster that took just a matter of hours to reprocess a vast scattered dataset comprising hundreds of thousands of files into a single set of daily files. Then with a few clicks the cluster could be rebooted without the Local SSD and with lower RAM to run a series of analytic algorithms over the full dataset, streaming the input data directly from GCS and streaming the results back directly into GCS, meaning there was no local disk requirement for any of the cluster machines and scalability was both effectively linear and essentially infinitely expandable. Finally, years of daily data could be loaded into BigQuery with a single command line, taking just 2 minutes 40 seconds to complete and making it possible to analyze billions of records in seconds per query.

Putting this all together, the ability to run massive adhoc analyses at these scales and speeds is a true testament to the power of the modern commercial cloud.