What would it look like to apply this approach to mine a day's worth of onscreen text from CNN OCR'd by the Cloud Video AI API to identify all of the people mentioned? OCR error, television news shorthand and inconsistent capitalization mean the results will be less accurate than on properly capitalized traditional news content, but nonetheless offers a look at the different ways we can construct ad-hoc analytic pipelines today.
In all, the "compromise" library identified 3,074 total names, though many are duplicates of others with additional preceding or succeeding words or OCR errors. While containing a heavy amount of error, the results nonetheless could be used to quickly triage the day's names. The final results can be seen in the spreadsheet below.
Following Felipe's example, the following query was used (in this case a temporary function was used to allow testing of different configurations and features:
CREATE TEMPORARY FUNCTION nlp_compromise_extract_people(str STRING) RETURNS ARRAY<STRING> LANGUAGE js AS ''' return nlp(str).people().out('topk').map(x=>x.normal) ''' OPTIONS ( library="gs://fh-bigquery/js/compromise.min.11.14.0.js"); select name, count(1) cnt from ( SELECT nlp_compromise_extract_people(OCRText) names FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) = "2020-04-18" and station='CNN' ), unnest(names) name group by name having cnt>=3 order by cnt desc