The GDELT Project

Announcing The WEB-NGRAM Character Ngram Datasets

In addition to the 142 languages contained in the word-based WEB-NGRAM dataset announced earlier this week, GDELT monitors an additional 10 languages which do not use spaces to separate their words. Traditionally these languages are processed in NLP contexts using statistical or machine learning word segmentation algorithms, but the resulting word ngrams overly reflect the nuances of the algorithms used and do not properly reflect realtime linguistic shifts and evolution in usage, while robust segmentation algorithms are not available for all languages.

Character ngrams have long been used in the NLP and linguistics community to computationally analyze such languages and thus as part of the WEB-NGRAM dataset, we are now generating character ngrams for  Burmese, Chinese, Dzongkha, Japanese, Khmer, Laothian, Thai, Tibetan and Vietnamese with lengths of 1, 2, 3, 4 and 5 characters.

Note that some common libraries and tools have incomplete Unicode support or are overly brittle when it comes to the kinds of real-world characters used on the open web and thus may report errors or fail to process certain lines, so you may need to adjust your import workflow accordingly.

The character ngram files are saved into a gzipped UTF8 ASCII file every 15 minutes in the format "http://data.gdeltproject.org/gdeltv3/web/ngrams/YYYYMMDDHHMMSS.chargram.txt.gz".

Within each file, each row represents a unique language/word/character sequence and is tab delimited with the following columns (there is no header row):

The character ngram file is found alongside the 1gram and 2gram files as part of the main WEB-NGRAM dataset.

Below is a list of the total number of characters of text monitored from each of the languages.

Language Total Characters
Chinese 12,823,748,014
VIETNAMESE 7,317,260,118
ChineseT 2,598,421,013
Japanese 1,025,068,769
THAI 913,383,383
KHMER 175,314,024
BURMESE 101,700,206
LAOTHIAN 2,788,334
TIBETAN 2,380,991
DZONGKHA 1,911