As we carefully construct the training and test corpi for our machine translation models, one tool we rely heavily upon is BigQuery's built-in ML.NGRAMS function, which allows the construction of arbitrary rolling-window word shingles across corpi of any size. It is extremely efficient and can generate a range of shingle sizes all in the same call, from unigrams up to any required size, returning the combined output. Simply by wrapping it around a nested set of preprocessing cleaning regular expressions and UDFs and passing the output to a GROUP BY, we can generate ngram histograms over tens or even hundreds of terabytes of text in just minutes.
In our case, we run it across our holdings in a given language in a form of Keyword In Context (KWIC) analysis. We run it first in unigram mode to generate a list of all of the unique words we've monitored in a given language. For all words that have appeared more than once, we then run it a second time with a range of larger window sizes and then filter this output to construct a set of KWIC windows of various sizes for each word. We repeat this process for phrases like entity names and well-known phrases. We then run a series of statistical and linguistic analyses on those windows to construct a balanced view of that word's contexts and especially unique outlier cases that represent unusual use cases to ensure that our models capture the highest fidelity possible view of each language and the fullest extent of human creativity in linguistic expression.