Kalev's latest Forbes piece explores what is really in the Internet Archive's Wayback Machine and the nuance and biases of its window onto the last 20 years of the evolution of the web.
Executing this study involved a substantial amount of log file analysis over very large log files. The study began by downloading the Alexa Top One Million file and using GNU parallel to fetch the list of all Archive snapshots of the homepage of each site via the Archive's CDX Index Server API. This resulted in one million small files. Even using a 32-core machine with 200GB of RAM and 10TB of SSD disk, and with all one million files resident in RAM buffer cache, using grep or wc to search or tally the files took 19 hours for each pass due to the massive overhead of execution startup when stream processing large numbers of small files. This could have been optimized through batching files together and/or writing customized scripts to avoid repeated execution startup of tools like grep, but in the end, due to the number and diversity of analyses to be performed, it was realized that something more sophisticated was needed.
A simple "cat" was used to concatenate all one million files into a single master log file. This 25GB file totaling just over 240 million rows was imported directly into BigQuery. In total it took less than five minutes to cat the million files together and upload them into BigQuery. Uploading into BigQuery was as simple as pointing it to the concatenated file, telling it to use spaces as delimiters, telling it the name and type of each column, and clicking the "load" button. Within just a few minutes, the entire 240 million record dataset was ready for querying.
Queries on the dataset typically take less than a second in BigQuery, while even more complex aggregation and summary statistic queries take just 5-10 seconds. This allowed realtime interactive exploration of the full dataset to tease out emerging patterns and nuances and afforded interactive hypothesis exploration to examine the contours of particular questions of interest.
This is a powerful use case for BigQuery – the ability to take very large delimited log files and load them within minutes into BigQuery for analysis across hundreds of millions of records for rapid exploratory analysis.