Using BigQuery UDFs To Perform On-The-Fly NLP On OCR’d Chyrons From The Cloud Video API

Last year Felipe Hoffa demonstrated the use of the "compromise" JavaScript NLP library as a BigQuery UDF, applying it to identify person names on-the-fly in text stored in BigQuery. In short, this basic NLP library can be loaded as a function into BigQuery and applied at runtime to extract names from massive volumes of text, all at BigQuery scale.

What would it look like to apply this approach to mine a day's worth of onscreen text from CNN OCR'd by the Cloud Video AI API to identify all of the people mentioned? OCR error, television news shorthand and inconsistent capitalization mean the results will be less accurate than on properly capitalized traditional news content, but nonetheless offers a look at the different ways we can construct ad-hoc analytic pipelines today.

In all, the "compromise" library identified 3,074 total names, though many are duplicates of others with additional preceding or succeeding words or OCR errors. While containing a heavy amount of error, the results nonetheless could be used to quickly triage the day's names. The final results can be seen in the spreadsheet below.

Download The Full Spreadsheet.

TECHNICAL DETAILS

Following Felipe's example, the following query was used (in this case a temporary function was used to allow testing of different configurations and features:

CREATE TEMPORARY FUNCTION nlp_compromise_extract_people(str STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
return nlp(str).people().out('topk').map(x=>x.normal)
'''
OPTIONS (
library="gs://fh-bigquery/js/compromise.min.11.14.0.js");

select name, count(1) cnt from (
SELECT nlp_compromise_extract_people(OCRText) names FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) = "2020-04-18" and station='CNN'
), unnest(names) name group by name having cnt>=3 order by cnt desc

The GDELT Project

Using BigQuery UDFs To Perform On-The-Fly NLP On OCR'd Chyrons From The Cloud Video API

TECHNICAL DETAILS

Archives