The GDELT Project

A Pilot Dataset Of Trump & Biden Tweets That Appeared On Television News In 2020

Twitter has become a primary communications medium for world leaders, with US President Donald Trump famously relying on the platform to issue a daily stream of statements and policy prescriptions. These tweets can drive the news cycle, being republished and commented on endlessly on television news. Last month we showed how we can visualize the almost hourly appearances of Trump’s tweets on a selection of national television news channels over the course of this year. What if we could go a step further and actually tie each onscreen appearance of a tweet back to the actual tweet ID, allowing researchers to understand which specific tweets got the most attention on television news?

What would it look like to scan television news using computer vision to identify every onscreen appearance of a tweet and connect it back to the actual tweet ID of the tweet in question? Understanding the interplay of social and mainstream media would open the door to a huge array of new research questions involving how social media drives mainstream coverage and can even short circuit traditional editorial processes like agenda setting. Increasingly, official government announcements, including public health policies, are being announced by tweet, with mainstream media republishing those tweets as government policy statements. How are these statements being covered on television media? What happens when social media platforms flag a post as disputed, does that decrease coverage or backfire and lead to additional news attention beyond what would have been otherwise expected? These are just some of the questions that become explorable once we connect social and television media.

How might we tractably search television news for tweets? Given that tweets are often displayed alongside the user's Twitter handle, what if we simply searched the OCR'd onscreen text of each second of airtime for Twitter handles and then searched that text against the tweets made by that user?

With the support of the Media-Data Research Consortium’s  (M-DRC) Google Cloud COVID-19 Research Grant “Quantifying the COVID-19 Public Health Media Narrative Through TV & Radio News Analysis,” we have used Google’s Cloud Video AI API to non-consumptively analyze the entirety of television news programming since Jan. 1, 2020 on BBC News London, CNN, MSNBC and Fox News, including using OCR to transcribe all of their onscreen text second by second. The complete open dataset of these non-consumptive machine annotations, along with annotations for ABC, CBS and NBC evening news broadcasts since 2010, are available as the Visual Global Entity Graph 2.0 (VGEG).

To explore the idea of scanning television news for tweets, we conducted a pilot analysis involving tweets by Donald Trump and Joe Biden given their outsized potential influence in setting the news agenda, especially around public health issues like COVID-19. From the VGEG 2.0 dataset we extracted the onscreen text of CNN, MSNBC and Fox News from Jan. 1 of this year through midday Sept. 4 and searched it for all mentions of the two candidate's Twitter handles. Using an archive of Donald Trump's tweets since 2015 from the Trump Twitter Archive website we used an automated approach to link each onscreen tweet appearance with the actual tweet in question, while for the much smaller number of Biden tweets we manually connected them.

The end result is a second-by-second chronology of airtime across CNN, MSNBC and Fox News this year displaying tweets from either candidate, with the onscreen appearance connected back to the actual tweet!

Some important caveats to this dataset:

You can download the datasets here:

With the caveats above in mind, we're tremendously excited to see what researchers are able to do with this pioneering new dataset. Most importantly, we hope this small experiment inspires a new way of thinking about the social-mainstream divide and leads to further research in automated scanning of television news for onscreen display of social media posts.

TECHNICAL DETAILS

Using the following query, we selected all 109,652 seconds of airtime as of midday Sept. 4 in which President Trump’s Twitter handle was displayed on screen:

SELECT date, station, showName, iaShowId, iaClipUrl, OCRText FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) >= "2020-01-01" and (LOWER(OCRText) like '%realdonald%' OR LOWER(OCRText) like '%donald j. trump retweeted%') order by date asc

His full social media handle “@realDonaldTrump” is shorted to “realdonald” here to match a common OCR error that splits the handle into two words “@realDonald Trump”. Retweets by the president are typically displayed by Twitter as “Donald J. Trump retweeted” and thus are separately included in the search above.

We then downloaded an archive of his tweets since 2015 from the Trump Twitter Archive website. This archive contains the majority of his tweets, though some deleted tweets may be missing.

To connect the two datasets, we used the following algorithm (see the PERL script for the full algorithm):

In all, of the 109,652 seconds of airtime referencing @realDonaldTrump in the onscreen text, 87,816 seconds yielded a match and 20,757 did not. Of the matching seconds of airtime, 1,079 occurred during the precise second between programs, such as MSNBC transitioning from its 4PM to its 5PM show. Since the VGEG counts each program separately, this yields two entries for the same second – we drop the second match since it is duplicative for these purposes.

For Joe Biden, the OCR process appears to have frequently split his social media handle "@JoeBiden" into "@Joe" and "Biden" (similar to how "@realDonaldTrump" was frequently split into "@realDonald" and "Trump") so the following SQL query was used to retrieve all appearances:

SELECT date, station, showName, iaThumbnailUrl, iaClipUrl,OCRText FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) >= "2020-01-01" and (station='CNN' OR station='MSNBC' OR station='FOXNEWS') and REGEXP_CONTAINS(LOWER(OCRText), r'@\s*joe\s*biden') ORDER BY DATE ASC

Given the small number of results, these were then manually connected to their corresponding tweets. In this case, multiple tweets appearing onscreen concurrently were each uniquely recorded.