The GDELT Project

Announcing The New Web News NGrams 3.0 Dataset

We are tremendously excited to announce today the launch of the new Web News NGrams 3.0 Dataset! The original Web News NGrams 2.0 dataset launched in January 2019 and made it possible to trace global-scale linguistic trends in online news coverage. Yet, it lacked robust support for scriptio continua languages and unlike the Television News NGrams 2.0 dataset, there was no way to use it as an index to locate relevant coverage. Thus, while it could tell you that an unusual new word use was emerging, there was no way for a human linguist to connect those emergent contexts back to the articles they appeared in. Similarly, it could tell you that a new term like "omicron" was suddenly surging in use, but could not tell you the list of articles it appeared in so that you could learn more. The new Web News NGrams 3.0 Dataset is designed to solve these issues by using a new Unicode-driven approach to language preprocessing and delineation and adding provenance links.

The resulting Web News NGrams 3.0 Dataset consists of a realtime unigrams dataset, updated each minute and covering all 152 languages GDELT currently monitors. For space-delimited languages, articles are split into words and each individual word becomes a unigram, while for scriptio continua languages (languages which do not use spaces to segment morphemes), articles are split into individual extended grapheme clusters (a series of code points that visually render to a single ideographic or alphabetic character including all modifiers) and each individual character becomes a unigram. This new approach to scriptio continua languages allows these languages to be examined in the same way as multi-word ngrams for space-segmented languages, by using the entry ngram as a prefix.

Each entry provides a brief contextual snippet showing the unigram in context. This can be used to extend the unigrams to bigrams, trigrams, quadgrams and longer ngrams. To match a bigram, search for its first word/character in the ngram field and its second word/character as the start to the "post" snippet field. To match a trigram, search for its middle word in the ngram field, its first word as the last entry of the "pre" snippet field and its third word as the first word of the "post" snippet field (to center it in the snippet fields to maximize the surrounding context). Similarly, to search for a 5-gram, search for its middle word as the ngram, its first two words as the last two words of the "pre" field and its last two words as the first two words of the "post" field, and so on. The same process can be used for character-based ngrams for scriptio continua languages.

Each ngram record links back to the URL of the underlying article, meaning you can scan the ngrams to find the URLs of all of the articles that mentioned "pneumonia" or "protests" or "climate change" or "omicron variant" or "food shortages" or any desired term. This enables its use for analyses from advanced linguistic research to relevancy filtering, directing users to the most relevant news coverage for a given topic.

Coupled with the GDELT Article List, Global Entity Graph and Global Similarity Graph, the Web News NGrams 3.0 Dataset can be used to provide state-of-the-art semantic news search, directing users to the coverage most relevant to their queries while looking across countries and languages.

Processing Pipeline

The following processing pipeline is used for each article to convert it to ngrams:

The end result is an ngram table that consists of unigram words for space-segmented languages and individual characters for scriptio continua languages. For alphabetic scriptio continua languages the use of individual characters as unigrams is inefficient, since they do not possess independent meaning on their own, but makes the matching of actual morphemes highly efficient since they can be used for prefix matching.

File Format

The final output is a UTF8 JSON-NL file each minute with the following fields:

How To Use

Searching the dataset for a specific ngram is as simple as:

For the title and other article-level metadata about each underlying article, this dataset can be joined with the the GDELT Article List.

Dataset

The dataset currently runs from January 1, 2020 to present. The dataset can be downloaded directly every minute as a JSON file with the following URL structure, with the date represented as "YYYYMMDDHHMMSS" in the UTC timezone. The first available file is seen below.

You can also query the dataset in BigQuery:

Note that a high level of duplicate URLs will appear in this dataset at this time while we are in the processing of updating it to GDELT 3.0, which you can safely filter out. The historical backfile contains a substantially elevated number of duplicate records due to how it was computed – as with the frontfile, these can be safely filtered out.

We are tremendously excited about the fundamentally new kinds of linguistic research and relevancy filtering this new dataset makes possible!