Using BigQuery + Regex's On OCR'd Chyrons From The Cloud Video API To Compile A List Of "Dr."'s On TV News

Between an hour and a third of the airtime on CNN each day since Feb. 26th has mentioned "Dr." somewhere in the onscreen text. Many of these mentions are in the chyron text in the lower third of the screen listing the name, title and affiliation of the person speaking, whether an interviewee guest or the current speaker at a press conference. Who are all of these people being interviewed?

With a single regular expression in BigQuery it is trivial to extract all of the two-word phrases from the onscreen OCR that follow "Dr." While this will truncate names longer than two words, it should extract most names accurately.

The final query becomes the simple one-liner:

SELECT REGEXP_EXTRACT( OCRText, r'Dr. ([A-Za-z]+ [A-Za-z]+)') name, count(1) cnt FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) >= "2020-01-25" and station='CNN' group by name having length(name) > 3 order by cnt desc

This takes just 2.1 seconds to extract 967 distinct names, with the top hits being CNN's own Sanjay Gupta, followed by Anthony Fauci, Deborah Birx, Celine Gounder, Leana Wen, Peter Hotez, Robert Redfield and so on.

The complete list of all "Dr. X Y" names from January 25 through April 21 generated by the query above can be downloaded here.

What if we want more context about each name, such as their title and affiliation?

While imperfect, we can ask for all of the text following "Dr." up until the first number. Since CNN typically runs the current time to the right of the chyron text, this works reasonably well, though is far from perfect and generates a high false positive rate. In practice, this query is useful more for looking up a name from the list above such as searching for "Darria Long" to find that she is "Darria Long Emergency Physician, Univ. of Tennessee Health System."

The final query becomes:

SELECT REGEXP_EXTRACT( OCRText, r'Dr. ([A-Za-z]+ [^0-9]+)') name, count(1) cnt FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) >= "2020-01-25" and station='CNN' group by name having cnt > 1 order by cnt desc

The complete list of 10,759 entries can be downloaded here.

Hopefully this gives you some ideas of the powerful ways in which even simple regular expressions can be used to better understand who is telling the story on television news.