An Updated Pilot Dataset Of Trump Tweets That Appeared On Television News 2020-2021

Last October we released a pilot dataset of the @realDonaldTrump tweets that appeared on television news January 1, 2020 through September 4, 2020, using OCR to scan the onscreen text of CNN, MSNBC and Fox News to identify every second of airtime in which the string "@realDonaldTrump" appeared somewhere onscreen. The majority of these are onscreen displays of his tweets, though they also include campaign signs with his social media handle and other invocations of his handle as shorthand for the president himself.

Using the VGEG 2.0 dataset it takes just a simple SQL query to download the dataset of all 68,151 seconds of airtime across CNN, MSNBC and Fox News from September 4, 2020 through January 19, 2021 that contain his Twitter handle:

SELECT date, station, showName, iaShowId, iaClipUrl, OCRText FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) >= "2020-9-4" and (LOWER(OCRText) like '%realdonald%' OR LOWER(OCRText) like '%donald j. trump retweeted%')
and (station='CNN' or station='MSNBC' or station='FOXNEWS') order by date asc

You can download the complete dataset in CSV format:

For those that want the complete January 1, 2020 to present dataset that includes BBC News London and the ABC, CBS and NBC evening news broadcasts, we have a second dataset of 189,361 seconds of airtime using the following query:

SELECT date, station, showName, iaShowId, iaClipUrl, OCRText FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) >= "2020-1-1" and (LOWER(OCRText) like '%realdonald%' OR LOWER(OCRText) like '%donald j. trump retweeted%') order by date asc

You can download the complete dataset in CSV format:

The datasets above include a master list of every second of airtime his Twitter handle was visible onscreen, but don't attempt to connect those appearances back to the actual tweetids of the underlying tweets. For those that need to know which specific tweet each second of airtime is displaying, we have repeated the workflow we used in the October 2020 dataset and used similarity matching to find the best match from the Trump Twitter Archive for each second of airtime in which @realDonaldTrump is visible across CNN, MSNBC and Fox News from where the last dataset left off (September 4, 2020) through January 19, 2021.

With the format change of the new Trump Twitter Archive complete dataset, we adjusted the tweet loading process of our previous script and created a new version of it:

You will also need to download the television dataset in JSON format (rather than the CSV version above):

To run it, just download the complete tweet dataset and the list of matching television airtime seconds and run the script over them!

Of the 68,151 seconds of airtime in which @realDonaldTrump appeared September 4, 2020) through January 19, 2021 across CNN, MSNBC and Fox News, the script was able to match 35,585 to the actual tweet sent, while 30,999 seconds were unable to be connected to a distinct tweet, though many of these are mentions of his account rather than actual tweets he himself sent.

You can download the "matches" and "nomatches" files (see the original post for details):

We're tremendously excited to see the kinds of research questions this dataset enables!