Using the "video time" edition of the Advertising Inventory Files (AIF) dataset, we can filter the Visual Global Entity Graph 2.0 to alternative exclude or focus on advertisements when we analyze the OCR'd onscreen text.
For example, we can trivially compute a histogram of the top words appearing in the onscreen text of CNN on May 1, 2021:
SELECT ngram, count(1) cnt FROM `gdelt-bq.gdeltv2.vgegv2_iatv`, UNNEST(ML.NGRAMS(SPLIT(REGEXP_REPLACE(LOWER(OCRText), r'(\pP)', r' \1 '), ' '), [1,1], ' ')) as ngram WHERE length(ngram) > 2 and DATE(date) = "2021-05-01" and station='CNN' group by ngram order by cnt desc limit 1000
Yielding top ten entries:
ngram | cnt |
cnn | 65136 |
the | 51885 |
live | 49934 |
new | 36198 |
tomorrow | 26624 |
united | 23969 |
coronavirus | 23550 |
pandemic | 22854 |
via | 18582 |
for | 18472 |
We can trivially exclude advertising airtime from this computation using:
SELECT ngram, count(1) cnt FROM `gdelt-bq.gdeltv2.vgegv2_iatv`, UNNEST(ML.NGRAMS(SPLIT(REGEXP_REPLACE(LOWER(OCRText), r'(\pP)', r' \1 '), ' '), [1,1], ' ')) as ngram WHERE length(ngram) > 2 and DATE(date) = "2021-05-01" and station='CNN' and date not in (SELECT date FROM `gdelt-bq.gdeltv2.iatv_aif_vidtime` WHERE DATE(date) = "2021-05-01" and station='CNN' and type!='NEWS') group by ngram order by cnt desc limit 1000
This doesn't change the results too much, though "story" now appears rather than "for."
ngram | cnt |
cnn | 62369 |
live | 47975 |
the | 41617 |
new | 34102 |
tomorrow | 25678 |
coronavirus | 22908 |
united | 22789 |
pandemic | 22208 |
via | 18487 |
story | 17169 |
Alternatively, changing the "date not in" filter to "date in" allows us to examine just the OCR'd text of advertising:
SELECT ngram, count(1) cnt FROM `gdelt-bq.gdeltv2.vgegv2_iatv`, UNNEST(ML.NGRAMS(SPLIT(REGEXP_REPLACE(LOWER(OCRText), r'(\pP)', r' \1 '), ' '), [1,1], ' ')) as ngram WHERE length(ngram) > 2 and DATE(date) = "2021-05-01" and station='CNN' and date in (SELECT date FROM `gdelt-bq.gdeltv2.iatv_aif_vidtime` WHERE DATE(date) = "2021-05-01" and station='CNN' and type!='NEWS') group by ngram order by cnt desc limit 1000
This yields these top 20 entries, showing the dominance of "com" (as in ".com" URLs), "800" (as in 1-800 numbers), and "free" (as in free shipping, free trial, etc).
ngram | cnt |
the | 10268 |
and | 8305 |
com | 7512 |
for | 5925 |
with | 3308 |
not | 2930 |
cnn | 2760 |
your | 2704 |
800 | 2502 |
are | 2200 |
may | 2164 |
new | 2096 |
live | 1954 |
you | 1927 |
all | 1904 |
free | 1735 |
well | 1596 |
can | 1431 |
apply | 1390 |
health | 1347 |
We hope this inspires you in new ways you can explore these datasets!