Compiling A Master List Of Social Media In The News 2016-2019

Following on our release earlier today of all of the Donald Trump tweets found in the worldwide news coverage monitored by GDELT 2016-2019, today we are releasing its counterpart: a massive dataset of every link to a social media post on Facebook, Instagram, QQ, Twitter, Vimeo, VK and YouTube from the start of GDELT's outlink monitoring on April 20, 2016 through the end of September 7, 2019. Only links to actual posts, not user accounts were included. A small percentage of links may have errors due to HTML parsing errors at the time the page was crawled.

All of the data found in this dataset is contained in the GDELT GKG 2.0 from which this extract was compiled, but we hope by condensing it into this more usable format, we can jumpstart applications looking at what social media posts are covered in the news.

We are especially interested in how this dataset might be used in efforts to combat misinformation, disinformation, digital falsehoods and foreign influence. Identifying local social media content that has been linked to by local news outlets offers pointers to content that may be at least partially vetted and at the very least captures those social posts that have leapt from the social to the news sphere and are attracting attention, meaning they may be of interest to fact checkers.

In all, 35,611,589 distinct document-link pairs were found from 14,700,237 distinct news articles to 13,389,091 distinct social media posts. In all, BigQuery took 118 seconds to process 294GB.

The final dataset can be downloaded below:

TECHNICAL DETAILS

For those interested in how this list was created, the following Standard SQL query was used to extract the links from the GKG 2.0. The specific filters below were chosen through manual review of the data, with an eye towards extracting only posts rather than links to user accounts.

WITH nested AS (
SELECT DATE, DocumentIdentifier, SPLIT(REGEXP_EXTRACT(Extras, r'<PAGE_LINKS>(.*?)</PAGE_LINKS>'), ';') links FROM `gdelt-bq.gdeltv2.gkg_partitioned` WHERE Extras like '%<PAGE_LINKS>%' AND DATE<20190908000000
) select DATE, DocumentIdentifier, link SocialLink from nested, UNNEST(links) as link WHERE LOWER(link) like '%twitter.com%/status/%' OR LOWER(link) like '%/t.co/%' OR LOWER(link) like '%facebook.com%/posts/%' OR LOWER(link) like '%facebook.com%/photos/%' OR REGEXP_CONTAINS(LOWER(link), r'vimeo.com/[0-9]') OR LOWER(link) like '%/vk.com%wall%' OR LOWER(link) like '%/vk.com%video%' OR LOWER(link) like '%.qq.com/%' OR LOWER(link) like '%youtube.com%v=%' OR LOWER(link) like '%instagram.com/p/%'