2015-internet-archive-hathitrust-books

What It Looks Like to Process 3.5 Million Books

While we'll be putting together a longer piece soon that goes into far more detail on how we processed the 3.5 million book archive, we wanted to put out a short blog post that gives a broad level overview of the project's technical infrastructure.  Once again we'd like to thank Google for the computing power that made this project possible.

HathiTrust research extracts are delivered through rsync to a single machine, so a single 8 core machine with a 2TB SSD persistent disk was used to ingest the content, running 8 rsync processes in parallel (which maxed out the available bandwidth from the HathiTrust rsync server).  The machine was then rebooted into a 16-core machine that was used to unzip, repack the one-file-per-page format of the HathiTrust books into a single ASCII file per book, and ship the books to Google Cloud Storage (GCS).  GCS was used to house the complete collection to make it instantly available to all Google Compute Engine (GCE) instances within the project (more on this in a moment).

Internet Archive books are available via standard HTTP GET from the Archive's website.  A single 16-core machine was used to download all books in parallel using CURL and GNU Parallel, zip the books up, and store them into GCS.

At this point all 3.5 million books were stored into GCS.  An existing 8 core machine with 1.5TB of SSD persistent disk that was being used as a MySQL database server and simple Apache webserver for another experiment was used to coordinate the processing of the books.  All 3.5 million GCS filenames of the books were stored into a database table, 100 to a record.  A simple CGI script allows basic coordination via HTTP GET.  A client can request a new batch of 100 books to process, in which case the CGI script requests a new record from the database and marks it as "in progress" along with the timestamp and IP of the machine processing it, and a client can inform the coordination server that it has finished processing a given batch, in which case the CGI script marks the record in the database as "complete".  This is all the management server does, keeping its role simple.

Ten 16 core High Mem (100GB RAM) GCE instances were launched (160 cores total) to handle the processing of the books, each with a 50GB persistent SSD root disk.  SSD disk was used for their root disks since disk IO is minimal for GDELT GKG processing, so SSD disks allowed using a small disk per machine, but ensuring that the 50GB disk had sufficient IO servicing capabilities (since SSD disks offer far greater per-GB IO rates compared with traditional persistent disks).

Each of the ten compute instances ran a background process that was a simple PERL script that maintained a cache directory of 32 (16 cores * 2) batches of books to process (recall that each batch contains a list of 100 books).  Every 30 seconds it would check to ensure there were 32 batches ready and if not, it would contact the central management server via a simple HTTP GET request and get back another batch of books to process.  Each request for a new batch of books from the central management server results in a list of 100 GCS filenames, which the PERL script then downloads from GCS, unzips, and stores together in a single XML file.  A set of 16 GDELT GKG processing engines run in separate threads, one per core, and each processes a batch, then requests another batch file from the central queue.  The separation of downloading and processing ensures that processing threads are never stalled while books are downloaded, unzipped and repacked from GCS.  At the end of processing a batch, the compute thread uploads the final extracted metadata back to GCS, confirms the upload, then notifies the management server that the batch has been completed, then moves to the next batch.

GCS is used as a central storage fabric connecting all nodes.  Unlike a traditional filesystem, GCS shows no measurable degradation even when storing many millions of small files in a single directory, making it ideal for an application like processing millions of books.  By storing each book as its own file in GCS, it was possible to adjust the batch size from a fixed size of 100 books per batch to a more variable size later on during the project that grouped books in adhoc ways to work around longer books and those with more complexities (such as higher OCR error or other issues that require more RAM or CPU time to process).  The ability of any GCE instance within a project to read and write to a common GCS storage pool meant that all ten compute instances could read from the same pool of books and write their results to the same central scratch directory.  Unlike typical network filesystems used on traditional HPC clusters, GCS exhibits no measurable performance degradation, even under extremely heavy load with 160 processes uploading to it and 10 processes downloading from it simultaneously at times, meaning that it can serve as an effectively-linearly-scaling shared storage fabric connecting a cluster of compute nodes.

Finally, a 32-core instance with 200GB of RAM, a 10TB persistent SSD disk, and four 375GB direct-attached Local SSD Disks was used to reassemble the results, merge with the library-provided metadata, and reorder and sort the books by publication year into separate files.  Sustained bandwidth of over 750MB/s was seen on the instance, making it possible to rapidly assemble all of the results in a single place.  Output files from the compute nodes were downloaded from GCS in 300GB chunks directly to one of the Local SSD Disks, then the other three drives were used in parallel to reassemble, merge, and reorder the results, with the final results then copied to the 10TB persistent disk.  Finally, the results were merged back from the 10TB disk batch to the scratch disks in 300GB chunks, uploaded to GCS and then to BigQuery, and zipped and stored for permanent archival into GCS.  Once in BigQuery, all kinds of analyses are possible, including scanning the complete fulltext of 122 years of books totaling half a terabyte in just 11 seconds, which works out to around 45.5GB/s – pretty incredible for a single line of SQL code.

Perhaps most amazingly is that the entire project took just two weeks, much of which was consumed with human verification and addressing of issues with the library-provided publication metadata for each book, requiring reprocessing of subsets of the collection.  While 160 cores and 1TB of RAM on the one hand is quite a powerful small cluster, previous attempts to process just a subset of the collection had used twice the number of cores on a state-of-the-art modern HPC supercluster for over a month of total processing time and extracted just a fraction of the data computed here.  The limiting factor with the supercluster was data movement – moving around terabytes of books and their computed metadata across hundreds of processors in realtime, and then making all of that material available for interactive querying.  That's where Google's Cloud Platform really shines in its purpose-built capacity for moving and working with massive datasets.  A project like this certainly consumes enormous processing power, but the real limiting factor is not the number or speed of the processors, but rather the ability to move all of that data around to keep those processors fed.  On the one hand, 160 processors is quite a lot, but on the other hand, the ability to simply spin up ten machines in just a few minutes, run them for two weeks, then spin them all down and have the final results available in a massive realtime database platform – well, that is the future of big data computing.

The ability of GCS to operate as a shared storage fabric across a cluster of machines, the ability to instantly spin up a mini-cluster in just minutes with 160 cores and 1TB of RAM, and single machines with 32 cores, 200GB of RAM, 10TB of SSD disk, and 1TB of direct-attached scratch SSD disk, with the results feeding directly into BigQuery via GCS, with BigQuery then allowing near-realtime querying over the results, is an example of what is possible under Google Cloud Platform with its emphasis on the kind of massive-data analysis and movement that is the lifeblood of projects like this.