Making NGrams At BigQuery Scale

Ever since 2010's Culturomics paper and with the rise of ever-more powerful linguistic modeling systems, ngrams have surged back into popularity as a quick, but primitive, way of studying language. On the surface, ngrams would appear to be quite simplistic to compute: just split each document into words and count up how many times each appears over the entire corpus. The former is easy and is an embarrassingly parallel problem to boot – every document can be split into words and converted to a frequency histogram independently of every other document. This still requires thousands of processors to do quickly, but those processors are only needed for a brief window of seconds to minutes: by the time you managed to spin up hundreds of VM's or got a batch job to run on an academic HPC system, you could have finished the analysis.

The bigger problem, however, lies in the merging of all those millions of individual book histograms into a final global histogram. This is a classic map-reduce problem and the algorithms and data flows to execute it are well understood. The problem is that it still requires a fair bit of code and time to spin up all that hardware and build the final pipelines.

This got us thinking, with all of the amazing things GDELT is able to accomplish using BigQuery, could we also use it to build ngrams in minutes with a single line of code?  Indeed, last year we shared a very brief example of computing 1-grams from a single year of the Internet Archive public domain English books collection, but could we expand this to compute over all 122 years of fulltext and even expand to the much more massive domain of bigrams and even trigrams?

To refresh our memory, here's the code to compute single word 1-grams over the entire Internet Archive public domain books collection. Notice that we've modified it slightly from the original example. In the original example we dropped anything that wasn't an ASCII character from A-Z, but that distorts things when we move to bigrams and trigrams, since it will cause sentences to blend into each other. Instead, the query below uses "\pP" to invoke the full Unicode-safe punctuation list (meaning we capture all forms of punctuation across all languages) and put a space around each punctuation item. This puts spaces around quote marks, commas, periods, etc so that they are treated as distinct words. Note that in some languages this can cause complications as internal apostrophes cause words to be split and down the road we will want to enhance this regex to properly handle those cases.

Without further ado, here's the final code for 1-grams:

select word, count FROM (
select word, count(*) as count FROM (
SELECT SPLIT(REGEXP_REPLACE(LOWER(BookMeta_FullText), r'(\pP)', r' \1 '), ' ') word
FROM (TABLE_QUERY([gdelt-bq:internetarchivebooks], 'REGEXP_EXTRACT(table_id, r"(\d{4})") BETWEEN "1800" AND "1922"'))
) group by word
) where count > 10000 order by count desc

Word of warning: this query will consume 416GB of your query quota given the total size of the fulltext of 122 years of books. In just 122 seconds this query processes 10.1 billion words to generate the final 1-gram histogram at an effective speed of around 83 million words per second! Keep in mind that we're taking a large performance penalty for using the TABLE_QUERY to span across so many different tables, so if this was all stuffed into a single table BigQuery could really scream. Also, remember that this includes sorting the final list and creating and writing it to a new table, so includes all of the disk IO and data management costs.

But, what happens when we want to move beyond this to bigrams? That requires the use of a moving window over the text, which is much more complex to implement. Once again, the amazing Felipe Hoffa came to the rescue with sample code for computing trigrams in BigQuery that he wrote back in 2011.  Modifying the code to work on the books corpus, we have the following query that computes the complete set of bigrams for 122 years of books in just a single line of code!

SELECT word, nextword, COUNT(*) c
FROM (
  SELECT
    pos,
    id,
    word,
    LEAD(word) OVER(PARTITION BY id ORDER BY pos) nextword,
  FROM (
    SELECT
      id,
      word,
      pos
    FROM
      FLATTEN( (
        SELECT
          id,
          word,
          POSITION(word) pos
        FROM (
          SELECT
            DocumentIdentifier id,
            SPLIT(REGEXP_REPLACE(LOWER(BookMeta_FullText), r'(\pP)', r' '), ' ') word
          FROM
            (TABLE_QUERY([gdelt-bq:internetarchivebooks], 'REGEXP_EXTRACT(table_id, r"(\d{4})") BETWEEN "1800" AND "1922"'))
         ) ), word) ))
WHERE
  length(word) > 1 and length(nextword) > 1
GROUP EACH BY
  1,
  2
HAVING c > 50
  ORDER BY
  c DESC

Just by adding an additional LEAD() to the query and adding to the GROUP EACH BY at the bottom, we can compute trigrams (or add yet more LEAD()/GROUP BY's to compute 4-grams and 5-grams! The query below starts with a total universe of 78.7 billion rows and processes it at a rate of more than 31 million words per second! Once again, a single line of code is able to compute arbitrary-length ngrams over 122 years of books totaling more than 10.1 billion words!

SELECT word, nextword, nextword2, COUNT(*) c
FROM (
  SELECT
    pos,
    id,
    word,
    LEAD(word) OVER(PARTITION BY id ORDER BY pos) nextword,
    LEAD(word, 2) OVER(PARTITION BY id ORDER BY pos) nextword2
  FROM (
    SELECT
      id,
      word,
      pos
    FROM
      FLATTEN( (
        SELECT
          id,
          word,
          POSITION(word) pos
        FROM (
          SELECT
            DocumentIdentifier id,
            SPLIT(REGEXP_REPLACE(LOWER(BookMeta_FullText), r'(\pP)', r' '), ' ') word
          FROM
            (TABLE_QUERY([gdelt-bq:internetarchivebooks], 'REGEXP_EXTRACT(table_id, r"(\d{4})") BETWEEN "1800" AND "1923"'))
         ) ), word) ))
WHERE
  length(word) > 1 and length(nextword) > 1 and length(nextword2) > 1
GROUP EACH BY
  1,
  2,
  3
HAVING c > 50
  ORDER BY
  c DESC

Its easy to modify the examples above to change the behavior of the final ngrams. For example, changing the contents of the SPLIT() command above to the following will remove all numbers and properly handle carriage returns. You can also remove the LOWER() if you'd prefer to record both lower and upper case forms of words.

      SPLIT(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(LOWER(BookMeta_FullText),r'\s+', ' '), r'\d', ' '), r'(\pP)', r' \1 '), ' ') word

If you want a simple word histogram that displays words as a percent of the total word count, you can modify the examples above to the following.

select word, count, count/max * 100 perc FROM (
select word, count, FIRST_VALUE(count) OVER (ORDER BY count desc) max FROM (
select word, count(*) as count FROM (
SELECT UNIQUE(SPLIT(REGEXP_REPLACE(LOWER(BookMeta_FullText),"\\pP", ' '), ' ')) as word
FROM (TABLE_QUERY([gdelt-bq:internetarchivebooks], 'REGEXP_EXTRACT(table_id, r"(\d{4})") BETWEEN "1800" AND "1922"'))
) group by word
) where count > 10 order by count desc
)

We're enormously excited by the potential of this new capability to generate arbitrary ngrams over massive text archives in just minutes, especially for the incredibly opportunities it offers to explore the evolution of grammatical constructs and the linguistic structures that undergird communication. We're also exploring how we can use large ngrams like 5-grams to contextualize actors and concepts across the globe as part of a massive new initiative we'll be announcing later this summer! Until then, happy ngramming!