Announcing The WEB-NGRAM Character Ngram Datasets

In addition to the 142 languages contained in the word-based WEB-NGRAM dataset announced earlier this week, GDELT monitors an additional 10 languages which do not use spaces to separate their words. Traditionally these languages are processed in NLP contexts using statistical or machine learning word segmentation algorithms, but the resulting word ngrams overly reflect the nuances of the algorithms used and do not properly reflect realtime linguistic shifts and evolution in usage, while robust segmentation algorithms are not available for all languages.

Character ngrams have long been used in the NLP and linguistics community to computationally analyze such languages and thus as part of the WEB-NGRAM dataset, we are now generating character ngrams for  Burmese, Chinese, Dzongkha, Japanese, Khmer, Laothian, Thai, Tibetan and Vietnamese with lengths of 1, 2, 3, 4 and 5 characters.

Note that some common libraries and tools have incomplete Unicode support or are overly brittle when it comes to the kinds of real-world characters used on the open web and thus may report errors or fail to process certain lines, so you may need to adjust your import workflow accordingly.

The character ngram files are saved into a gzipped UTF8 ASCII file every 15 minutes in the format "http://data.gdeltproject.org/gdeltv3/web/ngrams/YYYYMMDDHHMMSS.chargram.txt.gz".

Within each file, each row represents a unique language/word/character sequence and is tab delimited with the following columns (there is no header row):

  • DATE. The date in YYYYMMDDHHMMSS format. This is included in the file to make it easier to load the ngrams as-is into a database for analysis.
  • LANG. The human-readable language name as output by CLD2. Most language names are in all uppercase, though a few like Korean appear in titlecase and some may have underscores.
  • CHARLEN. This is the length of the character sequence, from 1 to 5.
  • NGRAM. The character sequence.
  • COUNT. The number of times the character sequence was found in articles of that language published in that given 15 minute interval.

The character ngram file is found alongside the 1gram and 2gram files as part of the main WEB-NGRAM dataset.

Below is a list of the total number of characters of text monitored from each of the languages.

Language Total Characters
Chinese 12,823,748,014
VIETNAMESE 7,317,260,118
ChineseT 2,598,421,013
Japanese 1,025,068,769
THAI 913,383,383
KHMER 175,314,024
BURMESE 101,700,206
LAOTHIAN 2,788,334
TIBETAN 2,380,991
DZONGKHA 1,911