The GDELT Project

A 3.5 Year 634-Million Article Historical Backfile For The Extended Moral Foundation Dictionary

We announced yesterday that the Extended Moral Foundation Dictionary was now a part of GCAM and being applied to all articles being monitored from here forward. Today we are excited to announce that we have computed a 3.5 year historical backfile that computes the eMFD scores for a large fraction of the coverage GDELT has monitored since January 1, 2017, totaling more than 634 million articles in all across 65 languages.

To compute these scores, we used BigQuery to convert each article to lowercase, strip away punctuation, split into words and run against the eMFD dictionary. The simplistic regular expressions used here are not a replacement for the advanced linguistic processing used by GCAM, so they will invariably make mistakes and thus the scores may have some small degree of error. Since the eMFD is currently available only for English, it was applied to the English machine translations of all non-English articles and thus for coverage not originally in English, machine translation errors will add an additional layer of error. It is important to note that the eMFD was not designed nor calibrated for use on machine translated content so it is unclear to what degree scores may be affected by these linguistic differences. Finally, given that this was a prototype experiment to test how easily large historical emotional backfiles could be computed, it is likely that some percentage of articles were missed during the loading process. Thus, this dataset represents a "best effort" experimental prototype of computing historical emotional backfiles.

Each year is saved as a UTF8 CSV file, with the following columns:

Each year is available as its own file, ranging from 10-30GB compressed to 30-90GB uncompressed:

We're tremendously excited what you're able to do with this massive backfile!

TECHNICAL DETAILS

To compute these scores, the eMFD dictionary and articles were loaded into temporary BigQuery tables, with the following single SQL query computing the scores with just around 10-20 minutes per year!

select min(DATE) DATE, URL, min(REPLACE(Title, "\uFFFD", "")) Title, COALESCE(min(Lang),'eng') Lang,
sum(care_p) / sum(cnt) care_p,
sum(fairness_p) / sum(cnt) fairness_p,
sum(loyalty_p) / sum(cnt) loyalty_p,
sum(authority_p) / sum(cnt) authority_p,
sum(sanctity_p) / sum(cnt) sanctity_p,
sum(care_sent) / sum(cnt) care_sent,
sum(fairness_sent) / sum(cnt) fairness_sent,
sum(loyalty_sent) / sum(cnt) loyalty_sent,
sum(authority_sent) / sum(cnt) authority_sent,
sum(sanctity_sent) / sum(cnt) sanctity_sent
 from (
  select DATE, URL, Title, Lang, a.word word, a.cnt cnt, b.care_p*a.cnt care_p, b.fairness_p*a.cnt fairness_p, b.loyalty_p*a.cnt loyalty_p, b.authority_p*a.cnt authority_p, b.sanctity_p*a.cnt sanctity_p, b.care_sent*a.cnt care_sent, b.fairness_sent*a.cnt fairness_sent, b.loyalty_sent*a.cnt loyalty_sent, b.authority_sent*a.cnt authority_sent, b.sanctity_sent*a.cnt sanctity_sent from (
      select min(DATE) DATE, URL, min(NativeTitle) Title, min(REGEXP_EXTRACT(TransCite, 'srclc:(.*?);')) Lang, ngram WORD, count(1) cnt FROM `[TEMP_ARTICLETABLE]`, UNNEST(ML.NGRAMS(SPLIT( REGEXP_REPLACE( REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(LOWER(EnglishFullText), r'(\pP)$', r''), r'^(\pP)', r''), r'(\pP)\s', r' '), r'\s(\pP)', r' ') , r'\s+', r' ') , ' ') , [1,1], '#')) ngram where DATE>=20200101000000 and DATE<=20201231999999 group by URL, ngram
  ) a JOIN (
      SELECT word, care_p, fairness_p, loyalty_p, authority_p, sanctity_p, care_sent, fairness_sent, loyalty_sent, authority_sent, sanctity_sent FROM `[TEMP_EMFDTABLE]`
  ) b on a.word = b.word
) group by URL

That's all there is to it!