Compiling A Dataset Of OnAir Trump Tweets On Television News In 2020

From January 1, 2020 through August 17, 2020, there were 97,278 total seconds of airtime in which the text "@realDonaldTrump" (or its common OCR error variant "@realDonald") appeared somewhere in the onscreen text on the 24/7 programming of BBC News London, CNN, MSNBC, Fox News and the evening news broadcasts of ABC, CBS, NBC. Most of these are onscreen displays of the president's tweets, though they can also be mentions of him in others' tweets, appearances of his social media handle on the backdrop of campaign events or citations of his handle when using shows display video footage or imagery from his accounts.

Though there are OCR errors and other onscreen text combined in with the actual tweet text that require some filtering, one could take this collection of potential onscreen tweet displays and join them against a collection of Trump's tweets to count up how often each of his tweets appeared on television news. Unlike "impressions" and other traditional social media metrics, this dataset offers the unique ability to rank Trump's tweets by their television news airtime.

You can download the complete set of seconds of airtime in CSV format:

The fields are "iaClipUrl,date,station,show,iaShowId,iaThumbnailUrl,ocr,asr,caption,captionnlp,visualEntities", direct from the TV AI 2.0 API.

TECHNICAL DETAILS

To compile this dataset, we first used the AI Television Explorer and used the OCR field to search for ("realdonald" OR "realdonaldtrump"). We then scrolled down to the "Top Clips" section and clicked on the "URL" link at the top right. This shows the URL used:

Then, copy the entire contents of the "query" parameter which in this case is:

%20(ocr:"REALDONALD"%20OR%20ocr:"REALDONALDTRUMP")%20%20(station:KGO%20OR%20station:KPIX%20OR%20station:KNTV%20)%20

Escape all of the quote marks with backslashes and surround it in quote marks to get:

"%20(ocr:\"REALDONALDTRUMP\"%20OR%20ocr:\"REALDONALD\")%20%20(station:KGO%20OR%20station:KPIX%20OR%20station:KNTV%20OR%20station:CNN%20OR%20station:MSNBC%20OR%20station:FOXNEWS%20OR%20station:BBCNEWS%20)%20"

Then download the "downloadsearchresults_tvai.pl" PERL script and hand it as its parameters the start date in YYYYMMDD format, the end date in YYYYMMDD format, the quoted escaped query above and the output file (the script requires a Linux system or system that provides "tail"):

time ./downloadsearchresults_tvai.pl 20200101 20200817 "%20(ocr:\"REALDONALDTRUMP\"%20OR%20ocr:\"REALDONALD\")%20%20(station:KGO%20OR%20station:KPIX%20OR%20station:KNTV%20OR%20station:CNN%20OR%20station:MSNBC%20OR%20station:FOXNEWS%20OR%20station:BBCNEWS%20)%20" ./RESULTS-TRUMPTWEETS-20200101-20200817.CSV

The script works by first requesting the timeline of matches for your query. It then iterates over the results timeline, grouping blocks of days together that return less than 5,000 results to minimize the number of requests to the API. It then downloads the results one block at a time and concatenates them together into a final output file.

You can use this to script and workflow to compile all sorts of extracts from the TV AI API!