Performing At-Scale Entity Extraction Over The News Using BigQuery UDFs & Web NGrams 3.0

Earlier this week we showed how to write a simple Perl script to download the latest Web NGrams 3.0 dataset each minute and extract its entities using a part of speech tagger. Today we're going to show how to run a simple JavaScript-based entity extractor at scale using BigQuery. The Web NGrams 3.0 dataset is available in BigQuery, which also supports the use of external JavaScript libraries as User Defined Functions (UDFs), allowing us to run any JavaScript library over the ngrams at BigQuery scale!

Simply by downloading a copy of any self-contained JavaScript library to GCS, we can use it like a built-in BigQuery function. Here we use the "compromise" library's topic extraction feature to take each snippet and extract all of the key entities within and return then as an array. The top code loads the compromise library from a copy sitting in GCS and sets it up to be called as a function in our SQL query. The innermost query concatenates the pre+ngram+post fields together into a "snippet" field, which the outer query runs through compromise to extract its topics. Finally, the outermost query groups by topic and for each topic returns a list of the unique URLs it was seen in. That's all there is to it!

CREATE TEMPORARY FUNCTION nlp_compromise_extract_topics(str STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
return nlp(str).topics().out('topk').map(x=>x.normal)
'''
OPTIONS (library="gs://fh-bigquery/js/compromise.min.11.14.0.js");

select topic, ARRAY_AGG(DISTINCT url) from (
  select url, nlp_compromise_extract_topics(snippet) topics from (
    SELECT url, CONCAT(pre, ' ', ngram, ' ', post) snippet FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) = "2022-01-19" and lang='en' limit 1000
  ) 
), UNNEST(topics) topic group by topic

This query doesn't check whether the entity appears at the start or end of the snippet or the other checks that our Perl script code did, but those could easily be added to the query to fully replicate the Perl script's logic.

Running the query above yields results like the following:

russia https://thediplomat.com/2021/10/tensions-rise-between-tajikistan-and-the-taliban/
https://www.unionleader.com/news/military/blinken-arrives-in-ukraine-says-russia-could-attack-at-short-notice/article_6dd0e7fe-92eb-5f9e-acb0-8a35dcd40d62.html
https://www.theuknews.com/news/272207220/nato-countries-urged-to-supply-more-weapons-to-ukraine
https://www.csmonitor.com/World/Europe/2022/0119/More-Russian-troops-head-to-Ukraine-s-border-as-diplomacy-stalls
https://www.washingtonpost.com/opinions/2022/01/19/house-republicans-aim-sanctions-putin-his-family-his-mistress/
van halen https://96krock.com/2022/01/19/valerie-bertinelli-on-eddie-van-halen-ive-never-felt-love-like-that/
https://ilovebobfm.com/2022/01/19/valerie-bertinelli-on-eddie-van-halen-ive-never-felt-love-like-that/
neil gorsuch https://www.washingtonexaminer.com/news/roberts-did-not-ask-gorsuch-or-any-other-justices-to-mask-up-on-bench

Of course, you can use any JavaScript library, so you could run your own customized extraction code at scale!