Announcing The New Web News NGrams 3.0 Dataset

We are tremendously excited to announce today the launch of the new Web News NGrams 3.0 Dataset! The original Web News NGrams 2.0 dataset launched in January 2019 and made it possible to trace global-scale linguistic trends in online news coverage. Yet, it lacked robust support for scriptio continua languages and unlike the Television News NGrams 2.0 dataset, there was no way to use it as an index to locate relevant coverage. Thus, while it could tell you that an unusual new word use was emerging, there was no way for a human linguist to connect those emergent contexts back to the articles they appeared in. Similarly, it could tell you that a new term like "omicron" was suddenly surging in use, but could not tell you the list of articles it appeared in so that you could learn more. The new Web News NGrams 3.0 Dataset is designed to solve these issues by using a new Unicode-driven approach to language preprocessing and delineation and adding provenance links.

The resulting Web News NGrams 3.0 Dataset consists of a realtime unigrams dataset, updated each minute and covering all 152 languages GDELT currently monitors. For space-delimited languages, articles are split into words and each individual word becomes a unigram, while for scriptio continua languages (languages which do not use spaces to segment morphemes), articles are split into individual extended grapheme clusters (a series of code points that visually render to a single ideographic or alphabetic character including all modifiers) and each individual character becomes a unigram. This new approach to scriptio continua languages allows these languages to be examined in the same way as multi-word ngrams for space-segmented languages, by using the entry ngram as a prefix.

Each entry provides a brief contextual snippet showing the unigram in context. This can be used to extend the unigrams to bigrams, trigrams, quadgrams and longer ngrams. To match a bigram, search for its first word/character in the ngram field and its second word/character as the start to the "post" snippet field. To match a trigram, search for its middle word in the ngram field, its first word as the last entry of the "pre" snippet field and its third word as the first word of the "post" snippet field (to center it in the snippet fields to maximize the surrounding context). Similarly, to search for a 5-gram, search for its middle word as the ngram, its first two words as the last two words of the "pre" field and its last two words as the first two words of the "post" field, and so on. The same process can be used for character-based ngrams for scriptio continua languages.

Each ngram record links back to the URL of the underlying article, meaning you can scan the ngrams to find the URLs of all of the articles that mentioned "pneumonia" or "protests" or "climate change" or "omicron variant" or "food shortages" or any desired term. This enables its use for analyses from advanced linguistic research to relevancy filtering, directing users to the most relevant news coverage for a given topic.

Coupled with the GDELT Article List, Global Entity Graph and Global Similarity Graph, the Web News NGrams 3.0 Dataset can be used to provide state-of-the-art semantic news search, directing users to the coverage most relevant to their queries while looking across countries and languages.

Processing Pipeline

The following processing pipeline is used for each article to convert it to ngrams:

  • First, a series of sanity checks are run over the article text to exclude articles for which CLD2 returned that its language estimation was too uncertain or articles that were too long or too short or appear to have other confounding characteristics, such as malformed Unicode.
  • Articles that pass these checks are then decomposed using the Unicode Normalization Form C (NFC) to standardize the expression of characters across all articles to a fully composed sequence of code points. This ensures that a given canonical character will always be represented the same way in the ngram dataset, regardless of how it was expressed in the source document.
  • All runs of spaces, carriage returns, tabs and language-specific space characters (as defined by the Unicode tables) are collapsed into single spaces.
  • Casing is left as-is to permit distinguishing differential cased meaning, such as "us" meaning "we/us" versus "US" as shorthand for the "United States."
  • Punctuation is left as-is, meaning ngrams will contain punctuation symbols. This means that quotation marks, commas, periods and other characters will appear in ngrams, both enabling research into the use of a given word within a specific punctuation context and ensuring that languages with unique punctuation usage are not damaged by a rules-based punctuation cleaner that misinterprets how that language uses punctuation. Thus, the ngram for "climate" might appear as "climate," or "climate." or "climate!" or ""climate" and so on, requiring language-aware punctuation filtering for search use cases.
  • Languages with space-delimited tokens (words in English, syllables in Vietnamese, etc) are then split into individual tokens. For scriptio continua languages, including both ideographic (such as Chinese and Japanese) and alphabetic languages (Dzongkha, Khmer, Laothian, Myanmar, Tibetan, Thai, etc), the approach many ngram systems take is to use probabilistic or ML approaches to split them into "words," such as the ICU tokenenizer used commonly by many fulltext search systems. Such an approach biases the results towards existing language use and harms the ability of the system to recognize emergent terms, which is especially problematic in the pandemic's era of rapid language evolution. Moreover, robust segmentation models do not exist for many scriptio continua languages. Instead, in this dataset, articles in scriptio continua languages are split into single-character tokens, using Unicode extended grapheme clusters to define character boundaries.
  • Each individual token (excluding tokens that consist solely of punctuation with no other characters present, such as an isolated period surrounded on both sides by spaces or punctuation characters in scriptio continua languages) becomes an ngram entry.

The end result is an ngram table that consists of unigram words for space-segmented languages and individual characters for scriptio continua languages. For alphabetic scriptio continua languages the use of individual characters as unigrams is inefficient, since they do not possess independent meaning on their own, but makes the matching of actual morphemes highly efficient since they can be used for prefix matching.

File Format

The final output is a UTF8 JSON-NL file each minute with the following fields:

  • date. The JSON timestamp when the article was seen by GDELT.
  • ngram. The word or character unigram. This includes surrounding punctuation. For scriptio continua languages this is a single extended grapheme cluster, which can combine multiple Unicode code points.
  • lang. The ISO 639-1 or 639-2 language code of the underlying article, as estimated by CLD2.
  • type. The segmentation type used to divide this text into ngrams. A value of 1 indicates that the language uses spaces to segment words and thus ngrams are "words," while a value of 2 indicates that the language is a scriptio continua language such as Chinese or Japanese that does not use spaces between words and thus ngrams are characters. Note that Vietnamese has a value of 1 since, while spaces do not divide words, they divide the text into sufficiently small collections of characters that it can be treated as "words" for the purpose of ngram generation. The primary use case of this field is to advise how to combine the pre+ngram+pos fields together when performing contextual search of an ngram entry. A value of 1 means the application should insert spaces on either side of the "ngram" field while a value of 2 indicates they should be combined without spaces.
  • pos. The article decile in which the word/character was found, containing a value of 0, 10, 20, 30, 40, 50, 60, 70, 80 or 90. A value of 0 indicates the ngram appeared in the first 10% of the article's text, a value of 20 indicates the ngram appeared in the second 10% of the article's text and so on. This makes it possible to understand whether a given ngram was mentioned right at the start of an article, suggesting high relevancy versus buried at the end of the article. It also allows coarse-grained understanding of co-occurrence, indicating whether two ngrams appeared within 10% of the article text together or were mentioned in opposite ends of the article. Since deciles represent a hard threshold, two side-by-side words that straddle the decline demarcation line might belong to separate deciles, so proximity analysis should use a margin of two deciles as a minimum window size.
  • pre. A brief contextual snippet of text preceding the appearance of the ngram allowing relevancy filtering and context determination. This is typically up to 7 words for space-segmented languages or the adjusted equivalent for scriptio continua languages that contains a similar amount of semantic information.
  • post. A brief contextual snippet of text following the appearance of the ngram allowing relevancy filtering and context determination. This is typically up to 7 words for space-segmented languages or the adjusted equivalent for scriptio continua languages that contains a similar amount of semantic information.
  • url. The URL of the underlying article from which the ngram was compiled.

How To Use

Searching the dataset for a specific ngram is as simple as:

  • Unigrams: Single Words/Ideograms. To search for a single word or ideogram, just search the contents of the "ngram" field. Note that this field will contain surrounding punctuation, so for simple keyword search you will want to strip punctuation in this field (using language-specific rules for acronyms like "I.B.M." or "U.S." or titles like "Dr." or "P.hd."). To perform relevancy filtering, scan the contents of the "pre" and "post" fields to understand the context of the mention to determine if it is relevant. This could be as simple as searching for specific keywords in the surrounding text or a full-fledged ML model designed to analyze the immediately surrounding tokens, such as a BERT model.
  • Bigrams: Two-Word/Character Phrases. To search for a two-word phrase, search for the first word in the "ngram" field" and the second word as the start of the "post" field. For example, to search for "climate change" you would search for "climate" in the "ngram" field (taking care to strip punctuation from the field first) and "change" as the start of the "post" field. The same holds for character bigrams.
  • Trigrams: Three-Word/Character Phrases. To search for a three-word phrase, you could simply extend the process used for bigrams, searching for the first word in the "ngram" field and the next two words as the start of the "post" field. However, this means that the effective post-ngram context will be reduced, since you are using one of the post words to match the ngram. Instead, search for the middle word as the "ngram" field, the first word as the last word of the "pre" field and the third word as the first word of the "post" field. This will center the ngram and ensure equal amounts of context in the pre and post fields.
  • Quadgrams/Five-Grams And Beyond. Simply repeat the process used for trigrams to extend to longer phrase matches. To search for "United Nations Human Rights Council" you would search for "Human" in the "ngrams" field (stripping away punctuation), "United Nations" as the end of the "pre" field and "Rights Council" as the start of the "post" field.
  • Character Grams. For both ideographic and alphabetic scriptio continua languages, the "ngram" field consists of a single character and associated punctuation. To search for a complete morpheme, simply use the same processes as above, but centered on characters rather than words. Remember that "characters" in this dataset are extended grapheme clusters and thus consists of the full sequence of Unicode code points necessary to render that character.

For the title and other article-level metadata about each underlying article, this dataset can be joined with the the GDELT Article List.

Dataset

The dataset currently runs from January 1, 2020 to present. The dataset can be downloaded directly every minute as a JSON file with the following URL structure, with the date represented as "YYYYMMDDHHMMSS" in the UTC timezone. The first available file is seen below.

  • http://data.gdeltproject.org/gdeltv3/webngrams/20200101000100.webngrams.json.gz

You can also query the dataset in BigQuery:

Note that a high level of duplicate URLs will appear in this dataset at this time while we are in the processing of updating it to GDELT 3.0, which you can safely filter out. The historical backfile contains a substantially elevated number of duplicate records due to how it was computed – as with the frontfile, these can be safely filtered out.

We are tremendously excited about the fundamentally new kinds of linguistic research and relevancy filtering this new dataset makes possible!