Yesterday we showed how a single SQL query in BigQuery could be used to compile a list of all of the Dr.'s mentioned in the OCR'd text of CNN since we started processing it on January 25th. But what if you want this list by day, to see which doctors were mentioned on which days? Or to expand this to all seven stations that make up the the Visual Global Entity Graph 2.0 (ABC/CBS/NBC evening news from July 2010 to present, CNN from Jan. 25, 2020 to present, and BBC News, MSNBC, Fox News from April 2, 2020 to present)?
With just two quick changes to our SQL from yesterday we can compile a by-day/by-station Dr. list!
NOTE that the results for April 21-24 are partial since not all shows have finished processing for those days and that the results for BBC News, CNN, MSNBC and Fox News each begin at the dates listed above.
The final query is:
SELECT DATE(date) day, station, REGEXP_EXTRACT( OCRText, r'Dr. ([A-Za-z]+ [A-Za-z]+)') name, count(1) cnt FROM `gdelt-bq.gdeltv2.vgegv2_iatv` group by station, day, name having length(name) > 3 order by day asc, cnt desc
The final list can be downloaded here.
The query above only returns names in chyrons with title capitalization and treats different capitalizations of the same name differently. What if we modify this to be case-insensitive by lower-casing the text? At the same time, let's drop names that only appear for a single second in a given day. In some cases "dr." may not include the period and may simply be "dr" so let's allow that to be optional as well. We end up with the following query:
SELECT DATE(date) day, station, REGEXP_EXTRACT( LOWER(OCRText), r'\bdr[\.]* ([A-Za-z]+ [A-Za-z]+)') name, count(1) cnt FROM `gdelt-bq.gdeltv2.vgegv2_iatv` group by station, day, name having length(name) > 3 and cnt > 1 order by day asc, cnt desc
The final list can be downloaded here.