At the core of all machine translation systems lie data. Vast archives of monolingual and bilingual training and testing data form the models that allow machines to translate from one language to another. News content differs from most other kinds of monolingual material in that it places a heavy emphasis on recency, with the expression of both time and space woven intricately into most stories. The who, what, where, when, why and how of the news mean its linguistic patterns often differ substantially from the more traditional kinds of content often used to train MT systems. Unlike almost any other form of information, news spans the totality of human existence, covering every imaginable topic from the most prominent to the most obscure, mentioning people, places, activities and things spanning the entire globe and weaves narratives from the simple to the intricately complex for myriad audiences from the layperson to the specialist. When taken at a whole, the vocabulary of the news spans the furthest corners of language, pushing the boundaries of MT systems more than almost any other application. Given that news media today often incorporates and references social media trends, news-optimized MT systems must equally handle social media content woven through news coverage, seamlessly looking across their disparate voices.
Constructing and evaluating the underlying corpi used to train and test MT models requires immense research, measuring everything from each document's representativeness of the language as a whole, the presence of bilingual content, topical and entity distribution, language and grammatical use, narrative structure and myriad other indicators to ensure a strong balanced snapshot of that language that will yield a robust MT model.
To construct these corpi and to analyze them, we use a two-part pipeline that makes use of both local VM RAM disks and BigQuery.
To construct our training and evaluation corpi, we draw from across GDELT's archives in a vast array of formats from modern to legacy and from widely supported like JSON through compact binary formats optimized for specific tasks. Many of these datasets are designed for specific tasks, are fully denormalized or are stored in formats optimized for their specific use case, such as Translingual 1.0 spectral logs that capture the grammatical constructs and word contexts our existing models have observed evolving into new forms and/or have struggled with. GDELT 2.0 made heavy use of purpose-built formats designed for highly optimized binary interchange, including formats that allowed UTF8-encoded Unicode content to be observed in parallel as both UTF8 and ASCII, allowing absolute maximization of throughput for latency-sensitive applications.
This means that drawing together all of this data and reformatting it into a consistent normalized format for machine translation corpus use requires an environment supportive of extremely high random IOP access patterns to extract, reformat, compile and normalize. While direct-attached "Local SSDs" in GCE offer high-performance high-IOP storage, RAM disks offer an orders-of-magnitude increase in performance, yet their filesystem-like semantics allow the use of existing filesystem-dependent tooling.
Today's M2 GCE VMs support up to 12TB of RAM in an SSI.
Thus, to bring our datasets together and normalize them, we use large-memory GCE VMs to draw together our disparate archives and normalize them into standard UTF8 JSON-NL files.
Once our corpi have been normalized and stored as JSON-NL files in GCS, we can operate on them directly through BigQuery's "external data sources" capability that allows us to query files from GCS directly, without having to first load them into a BigQuery table. We combine this with BigQuery's "EXPORT DATA" capability to write the results directly back to GCS.
Thus, we can compute any analysis of our corpus using a single BQ CLI command that combines an "external_table_definition" flag and an "EXPORT DATA" SQL command as:
time bq query --use_legacy_sql=false --external_table_definition=data::[TABLE DEFINITION GOES HERE]@NEWLINE_DELIMITED_JSON=gs://bucket/sourcefiles*.json "EXPORT DATA OPTIONS(uri='gs://tmpbucket/tmpoutputfiles-*.json', format='JSON', overwrite=true) AS select [QUERY GOES HERE] from data"
We can then concatenate the sharded output using a simple gsutil command:
time gsutil -q cat gs://tmpbucket/tmpoutputfiles-*.json > gs://bucket/finaloutputfile.json
The importance of BigQuery's ability to operate at "dataset scale" cannot be overemphasized in terms of the profoundly new capabilities it makes possible. The ability to sift population-scale patterns from repositories means we can identify those microlevel patterns too rare to be identified from small random samples and observe complex interactions that wouldn't be clearly captured at even medium sample sizes.
This workflow is a perfect example of the power of the modern cloud, using GCE's RAM disks to assemble and normalize vast archives and then BigQuery to interactively explore those archives at dataset scale.