Computing Quadgrams At BigQuery Scale Through ML.NGRAMS

Many questions in computational linguistics require the computation of character sequences over vast corpora, requiring strong scalability and robust distributed architectures capable of sharding textual archives of terabytes, tens or hundreds of terabytes or even petabytes of text across thousands or tens of thousands of cores, preprocessing each document, computing a rolling window of ngrams and aggregating and ranking those ngrams at corpus scale, returning a final corpus-level result.

It turns out that BigQuery offers a tailor-made environment for precisely this kind of analysis, complete with a built-in ngramming function that allows the computation of character ngrams like quadgrams with just a single SQL query!

With a single command you can ask BigQuery to load an archive of text documents stored in JSON format in GCS, convert all text to lowercase using Unicode-aware case conversion, replace all standard punctuation with spaces (again using Unicode rules), collapse runs of spaces, run a rolling window of any size (or even a range of windows all at once) across the resulting text, then sort the resulting ngrams, compute them as a percentage of the most-common ngram in the specified range of sizes and save the result to GCS!

The query below does all of this with a single CLI invocation:

time bq query --use_legacy_sql=false --external_table_definition=data::lang:STRING,text:STRING@NEWLINE_DELIMITED_JSON=gs://YOURGCSBUCKET/INPUTFILE.json \
"EXPORT DATA OPTIONS(uri='gs://YOURGCSBUCKET/TEMPEXPORTFILENAME-*.json', format='JSON', overwrite=true) AS select ngram, cnt, perc from (select ngram, count(1) cnt, count(1)/(max(count(1)) OVER ())*100 perc FROM data,UNNEST(ML.NGRAMS(SPLIT(TRIM( REGEXP_REPLACE( REGEXP_REPLACE(LOWER(text), r'\p{Common}', ' '), r'\s+', ' ') ), ''), [4,4], '')) ngram group by ngram) where perc >=50.0 order by cnt desc"

Note that the query is expressed in simple standard SQL, with BigQuery taking care of all of the underlying work to marshal the necessary cores, shard the data across them, handle distributed communication across all of the cores and return the final result to GCS!

Note how simplistic the query is, relying on ML.NGRAMS() to run the rolling window, SPLIT() to break by character and simple regular expressions to preprocess the text. This particular example relies on "\p{Common}" to replace all characters in the "Common" Unicode class, which includes a range of punctuation, but can be readily tailored to target language-specific punctuation that isn't included in Common, while "\s" might be adjusted to tailor it to a specific set of spacing codepoints, such as special handling for ZWSP's.

Note also that this query uses a rolling window over the underlying code points, rather than grapheme clusters, since that makes the resulting quadgrams readily compatible with external text analysis tools that typically operate on code points rather than higher-order character compositions. Since ML.NGRAMS() simply operates on the results of SPLIT(), you can trivially change the pipeline's concept of a "character" simply by adjusting the preprocessing regular expressions or incorporating a UDF for even more advanced preprocessing and using a custom SPLIT() token. You could even focus on a specific subset of a language, such as diacritic sequences.

In short, through the power of BigQuery, you can compute character sequence frequency tables over unlimited-size corpora using a single SQL statement and get the results back in just seconds to minutes without writing a single line od