The GDELT Project

Creating Television News OCR NGrams That Exclude Advertising Text

Using the "video time" edition of the Advertising Inventory Files (AIF) dataset, we can filter the Visual Global Entity Graph 2.0 to alternative exclude or focus on advertisements when we analyze the OCR'd onscreen text.

For example, we can trivially compute a histogram of the top words appearing in the onscreen text of CNN on May 1, 2021:

SELECT ngram, count(1) cnt FROM `gdelt-bq.gdeltv2.vgegv2_iatv`, UNNEST(ML.NGRAMS(SPLIT(REGEXP_REPLACE(LOWER(OCRText), r'(\pP)', r' \1 '), ' '), [1,1], ' ')) as ngram WHERE length(ngram) > 2 and DATE(date) = "2021-05-01" and station='CNN' group by ngram order by cnt desc limit 1000

Yielding top ten entries:

ngram cnt
cnn 65136
the 51885
live 49934
new 36198
tomorrow 26624
united 23969
coronavirus 23550
pandemic 22854
via 18582
for 18472

We can trivially exclude advertising airtime from this computation using:

SELECT ngram, count(1) cnt FROM `gdelt-bq.gdeltv2.vgegv2_iatv`, UNNEST(ML.NGRAMS(SPLIT(REGEXP_REPLACE(LOWER(OCRText), r'(\pP)', r' \1 '), ' '), [1,1], ' ')) as ngram WHERE length(ngram) > 2 and DATE(date) = "2021-05-01" and station='CNN' and 
date not in (SELECT date FROM `gdelt-bq.gdeltv2.iatv_aif_vidtime` WHERE DATE(date) = "2021-05-01" and station='CNN' and type!='NEWS') group by ngram order by cnt desc limit 1000

This doesn't change the results too much, though "story" now appears rather than "for."

ngram cnt
cnn 62369
live 47975
the 41617
new 34102
tomorrow 25678
coronavirus 22908
united 22789
pandemic 22208
via 18487
story 17169

Alternatively, changing the "date not in" filter to "date in" allows us to examine just the OCR'd text of advertising:

SELECT ngram, count(1) cnt FROM `gdelt-bq.gdeltv2.vgegv2_iatv`, UNNEST(ML.NGRAMS(SPLIT(REGEXP_REPLACE(LOWER(OCRText), r'(\pP)', r' \1 '), ' '), [1,1], ' ')) as ngram WHERE length(ngram) > 2 and DATE(date) = "2021-05-01" and station='CNN' and 
date in (SELECT date FROM `gdelt-bq.gdeltv2.iatv_aif_vidtime` WHERE DATE(date) = "2021-05-01" and station='CNN' and type!='NEWS') group by ngram order by cnt desc limit 1000

This yields these top 20 entries, showing the dominance of "com" (as in ".com" URLs), "800" (as in 1-800 numbers), and "free" (as in free shipping, free trial, etc).

ngram cnt
the 10268
and 8305
com 7512
for 5925
with 3308
not 2930
cnn 2760
your 2704
800 2502
are 2200
may 2164
new 2096
live 1954
you 1927
all 1904
free 1735
well 1596
can 1431
apply 1390
health 1347

We hope this inspires you in new ways you can explore these datasets!