Compiling A Massive Multilingual News-Based Stopword List

UPDATE (7/10/2020): Covid-19 has so dominated news coverage of the past half-year that "coronavirus" and "Covid-19" appear in some of the stoplists below and should be filtered out before using.

As we move rapidly towards launching multilingual search in the Context 2.0 API, a key missing component is a multilingual stopword list that covers all of the languages GDELT monitors. While there are myriad stopword lists available today, few cover more than a few dozen languages, if that, and even the most extensive don't cover all of the languages GDELT monitors. Most importantly, the majority are built from Wikipedia and a handful of other sources that often differ considerably in composition from the contemporary news coverage that is GDELT's focus.

Given that we already produce a unigram word frequency dataset over worldwide online news dating back to January 1, 2019, we can actually create our own ngram dataset!

Compiling the top 10,000 words over each language GDELT monitors requires just a single SQL query in BigQuery and just 22 seconds to process all 91.8 billion words GDELT has monitored over that period:

SELECT LANG, NGRAM, RANK, CNT FROM (
SELECT LANG, NGRAM, RANK() OVER(PARTITION BY LANG ORDER BY sum(COUNT) DESC) RANK, SUM(COUNT) CNT FROM `gdelt-bq.gdeltv2.web_1grams` GROUP BY LANG, NGRAM HAVING CNT > 50
) WHERE RANK <= 10000  ORDER BY CNT desc

You could also technically calculate the same results using an approximation function, though this method is slower in this case:

SELECT LANG, TOPS.value NGRAM, TOPS.sum COUNT FROM (
SELECT LANG, APPROX_TOP_SUM(NGRAM, CNT, 10000) TOPS FROM (
SELECT LANG, NGRAM, SUM(COUNT) CNT FROM `gdelt-bq.gdeltv2.web_1grams` GROUP BY LANG, NGRAM
) GROUP BY LANG
), UNNEST(TOPS) TOPS

In all, 142 languages had sufficient volume to generate a stopword list.

You can download the final stopword list in CSV format, where each word is a row in the format:

  • LANG. The human-readable language name returned by CLD2.
  • NGRAM. The unigram (space-segmented languages) or character (Burmese, Chinese, Dzongkha, Japanese, Khmer, Laothian, Thai, Tibetan and Vietnamese – for the time being these are not included).
  • RANK. The order of the word in that language (words with the same number of appearances have the same rank with the rank continuing with a hole at the next word).
  • CNT. The number of times the word was seen since January 1, 2019.

Remember that all language detection is fully automatic using CLD2 and thus there may be errors above. Also note that in articles with multiple languages, all words are classified under the primary language of the article.

We hope this dataset is useful!