Creating Television News OCR NGrams That Exclude Advertising Text

Kalev Leetaru

4 years ago

Using the "video time" edition of the Advertising Inventory Files (AIF) dataset, we can filter the Visual Global Entity Graph 2.0 to alternative exclude or focus on advertisements when we analyze the OCR'd onscreen text.

For example, we can trivially compute a histogram of the top words appearing in the onscreen text of CNN on May 1, 2021:

SELECT ngram, count(1) cnt FROM `gdelt-bq.gdeltv2.vgegv2_iatv`, UNNEST(ML.NGRAMS(SPLIT(REGEXP_REPLACE(LOWER(OCRText), r'(\pP)', r' \1 '), ' '), [1,1], ' ')) as ngram WHERE length(ngram) > 2 and DATE(date) = "2021-05-01" and station='CNN' group by ngram order by cnt desc limit 1000

Yielding top ten entries:

ngram	cnt
cnn	65136
the	51885
live	49934
new	36198
tomorrow	26624
united	23969
coronavirus	23550
pandemic	22854
via	18582
for	18472

We can trivially exclude advertising airtime from this computation using:

SELECT ngram, count(1) cnt FROM `gdelt-bq.gdeltv2.vgegv2_iatv`, UNNEST(ML.NGRAMS(SPLIT(REGEXP_REPLACE(LOWER(OCRText), r'(\pP)', r' \1 '), ' '), [1,1], ' ')) as ngram WHERE length(ngram) > 2 and DATE(date) = "2021-05-01" and station='CNN' and 
date not in (SELECT date FROM `gdelt-bq.gdeltv2.iatv_aif_vidtime` WHERE DATE(date) = "2021-05-01" and station='CNN' and type!='NEWS') group by ngram order by cnt desc limit 1000

This doesn't change the results too much, though "story" now appears rather than "for."

ngram	cnt
cnn	62369
live	47975
the	41617
new	34102
tomorrow	25678
coronavirus	22908
united	22789
pandemic	22208
via	18487
story	17169

Alternatively, changing the "date not in" filter to "date in" allows us to examine just the OCR'd text of advertising:

SELECT ngram, count(1) cnt FROM `gdelt-bq.gdeltv2.vgegv2_iatv`, UNNEST(ML.NGRAMS(SPLIT(REGEXP_REPLACE(LOWER(OCRText), r'(\pP)', r' \1 '), ' '), [1,1], ' ')) as ngram WHERE length(ngram) > 2 and DATE(date) = "2021-05-01" and station='CNN' and 
date in (SELECT date FROM `gdelt-bq.gdeltv2.iatv_aif_vidtime` WHERE DATE(date) = "2021-05-01" and station='CNN' and type!='NEWS') group by ngram order by cnt desc limit 1000

This yields these top 20 entries, showing the dominance of "com" (as in ".com" URLs), "800" (as in 1-800 numbers), and "free" (as in free shipping, free trial, etc).

ngram	cnt
the	10268
and	8305
com	7512
for	5925
with	3308
not	2930
cnn	2760
your	2704
800	2502
are	2200
may	2164
new	2096
live	1954
you	1927
all	1904
free	1735
well	1596
can	1431
apply	1390
health	1347

We hope this inspires you in new ways you can explore these datasets!