Since April 20, 2016, GDELT's GKG 2.0 has compiled all of the outlinks in the body text of each article, making it possible to examine what sites are being linked to by the world's news media each day. This got us thinking – given that this period covers the final stretch of Donald Trump's candidacy and his entire presidency and that his tweets have such an outsized global impact, how easy would it be to compile a list of all of his tweets that have appeared in news articles in the last three and a half years?
With just a single SQL query, BigQuery is able to process all 200GB of outlink data in just 34 seconds to compile a final histogram of the most popular Donald Trump tweets of the last three and a half years, as measured by the amount of media attention paid to them.
In all, 13,539 unique Donald Trump tweets were linked to a total of 1,121,583 times (a tweet may be linked to multiple times in a single article).
The final list can be downloaded below in CSV format, recording how many times each tweet was linked to. Note that shortened tweets of the form "t.co" are not included here, so the actual number of linked tweets may be higher, though a manual spot check of articles containing Trump tweets all linked to the full "https://twitter.com/realDonaldTrump/" URL.
We're excited to see what new insights you may uncover from this list and how it might be extended to the Twitter accounts of other major world leaders and public figures that are regularly in the news!
For those interested in how this list was created, the following Standard SQL query was initially tested.
WITH nested AS ( SELECT SPLIT(LOWER(REGEXP_EXTRACT(Extras, r'<PAGE_LINKS>(.*?)</PAGE_LINKS>')), ';') links FROM `gdelt-bq.gdeltv2.gkg_partitioned` WHERE LOWER(Extras) like '%twitter.com/realdonaldtrump%' ) select MAX(link) TwitterURL, count(1) Count from nested, UNNEST(links) as link where LOWER(link) like '%twitter.com/realdonaldtrump%' group by LOWER(link) order by Count desc
While this query works, it counts "http://" vs "https://" URLs as different tweets, treats tweets with "?ref_src=" and other suffix parameters as distinct and does not properly handle sites that wrap Twitter outlinks inside of a redirect URL.
Instead, the following query was ultimately used to compile the list above. It extracts the Tweet ID of each @realDonaldTrump tweet and compiles a final normalized list, addressing all of these issues.
select CONCAT('https://twitter.com/realDonaldTrump/status/', tweetid) Tweet, count(1) Count from ( WITH nested AS ( SELECT SPLIT(LOWER(REGEXP_EXTRACT(Extras, r'<PAGE_LINKS>(.*?)</PAGE_LINKS>')), ';') links FROM `gdelt-bq.gdeltv2.gkg_partitioned` WHERE LOWER(Extras) like '%twitter.com/realdonaldtrump%' ) select REGEXP_EXTRACT(link, r'/status/(\d+)') tweetid from nested, UNNEST(links) as link where LOWER(link) like '%twitter.com/realdonaldtrump%' ) where tweetid is not null group by tweetid order by Count desc