Today we are incredibly excited to announce the debut of the new Web News Ngram Datasets (WEB-NGRAM), offering one-word (1gram/unigram) and two-word (2gram/bigram) ngram/shingle word histograms at 15 minute resolution for all worldwide online news coverage monitored by GDELT from January 1, 2019 through present and updated every 15 minutes. Combined with our television news ngram dataset, this massive new ngram dataset offers unparalleled opportunities to explore the linguistic and narrative underpinnings of the journalistic lens through which we see the world around us.
The inaugural release of the WEB-NGRAM dataset unveiled today covers 42 billion words of news coverage in 142 languages spanning January 1, 2019 to present at 15 minute resolution and updating every 15 minutes from here forward.
Two ngram datasets are provided, one with unigrams and one with bigrams. Unigrams are computed simply by tallying up the number of unique words in each language in each 15-minute interval and the total number of times each appeared, yielding a simple 15-minute-resolution word histogram. Bigrams are computed the same way but use a two-word moving window that resets at each punctuation symbol. Thus, the English sentence "At the White House today, the president…" would yield unigrams of "at", "the", "white", "house", "today" and "president" and bigrams "at the", "the white", "white house", "house today" and "the president" (since it resets at the comma).
To maximize the number of languages represented in this dataset, it is processed separately from the core translation and processing pipeline that powers most of GDELT's datasets. Every online news article monitored by GDELT is run through Google's Chrome Language Detector 2 (CLD2) in its extended recognition mode. Articles which CLD2 flags as having a "low quality" language determination or which resolve to an "X_" language are excluded. This means that while GDELT today live translates all coverage it monitors in 65 languages, the WEB-NGRAM dataset covers 142 languages today and over time will cover languages even for which no robust machine translation solution is available today.
Our goal with this dataset is to provide the highest quality possible ngrams and extensive manual review found that while "low quality" matches can be sufficiently cleaned and filtered by GDELT's core processing pipeline and that articles that mix languages can be correctly handled by GDELT's translation infrastructure, for the purpose of translating that content into word frequency histograms, we wanted to limit this dataset to considering only high quality matches.
Articles passing this stage are converted to lower case according to Unicode rules and segmented by spaces into a sequence of "words" (defined as tokens without internal spaces). Each word is scanned against the Unicode "Punctuation" class. The appearance of punctuation at the start or end of a word is treated as a bigram barrier, preventing bigrams from spanning across punctuation of any kind. The punctuation is then stripped. Words which contain an "@" symbol or a period that does not appear between two numbers ("1.233") or as an acronym like "a.m." or "u.s.a." are removed and treated as a bigram barrier. Words that contain internal punctuation other than apostrophes, periods (that match the above cases), hyphens and colons (to permit cases like "1:23am") are similarly removed and treated as a bigram barrier.
We would be interested in hearing feedback on this processing pipeline, especially the implications of removing internally-punctuated words (other than those exception cases above) in various languages and the ramifications these rules might have for specific languages. This pipeline was developed after extensive testing over many months, but given the shear number of languages GDELT processes, the complexity of the world's languages and the incredible creativity of the world's authors, we would appreciate your feedback.
Languages that do not segment their words with spaces pose unique challenges to ngram construction. GDELT's core pipeline translates such languages through machine learning, using a combination of statistical word segmentation algorithms and direct translation that does not provide segmented output. Even with segmented output, the results of such segmentation reflect the specific statistical properties of that particular machine learning algorithm, which may in turn be biased by its original training data. In turn, this means that word frequency tables generated from the output of such algorithms reflect their own statistical preference, rather than necessarily capturing the true segmentation frequencies of that language.
Rather than mix "true" ngrams representing author-determined word boundaries with machine learning-based statistical segmentation, as well as our current lack of word segmentation algorithms for languages like Burmese, we have elected at this time not to include ngrams for Burmese, Chinese, Dzongkha, Japanese, Khmer, Laothian, Thai, Tibetan and Vietnamese. We are considering character-based rolling-window ngrams for these languages, but would appreciate feedback from linguists and researchers working with those languages as to the ngram constructions that would be most useful.
UPDATE (10/3/2019): We now provide 1-5 character ngrams for these languages, bringing the total to 152 languages!
The ngram files are saved into two gzipped UTF8 ASCII files every 15 minutes, one for the unigrams and one for the bigrams in the format "http://data.gdeltproject.org/gdeltv3/web/ngrams/YYYYMMDDHHMMSS.1gram/2gram.txt.gz".
Within each file, each row represents a unique language/word/phrase and is tab delimited with the following columns (there is no header row):
- DATE. The date in YYYYMMDDHHMMSS format. This is included in the file to make it easier to load the ngrams as-is into a database for analysis.
- LANG. The human-readable language name as output by CLD2. Most language names are in all uppercase, though a few like Korean appear in titlecase and some may have underscores.
- NGRAM. The word or phrase.
- COUNT. The number of times the word/phrase was mentioned in articles of that language published in that given 15 minute interval.
Every 15 minutes two ngram files are produced (one for unigrams and one for bigrams), typically around 7-10 minutes after the hour, 22-25 minutes after the hour, 37-40 minutes after the hour and 52-55 minutes after the hour. The easiest solution is to set a cronjob to run at 7, 25, 40 and 55 minutes after the hour and check the contents of "http://data.gdeltproject.org/gdeltv3/web/ngrams/LASTUPDATE.TXT" for the latest available files.
A master file containing a list of all of the files is also available at "http://data.gdeltproject.org/gdeltv3/web/ngrams/MASTERFILELIST.TXT".
Note that comparing the results from this dataset against GDELT Summary may yield different results since GDELT Summary returns the number of documents containing the search term, whereas this dataset records how many times it appeared. Most importantly, this dataset may include articles excluded from GDELT Summary (since GDELT Summary only includes articles that passed the core GDELT processing pipeline), while this dataset may exclude articles that were deemed low quality by CLD2, but were cleaned and filtered by the GDELT core processing pipeline and present in GDELT Summary (whereas they are excluded here).
You can download all of the files from:
- http://data.gdeltproject.org/gdeltv3/web/ngrams/MASTERFILELIST.TXT (master list of all files)
- http://data.gdeltproject.org/gdeltv3/web/ngrams/LASTUPDATE.TXT (updated every 15 minutes with the latest files)
The two tables are also available in Google's BigQuery:
Note that when using the BigQuery versions of the tables, the BigQuery UI will by default append "_PARTITIONTIME" filters to your SQL query. You must remove those, since the tables are not date partitioned (they use a dummy partitioning in order to permit clustering by LANG to optimize querying).
This is a highly experimental "alpha" release dataset and we may make changes and enhancements as we gather user feedback. We are very interested in hearing from users on any nuances of the dataset or processing pipeline that make it hard to use, any processing decisions in how we construct the dataset that adversely affect its representation of specific languages, recommendations for representing non-space-segmented languages without the use of statistical segmentation models, etc.
We are incredibly excited to see what you are able to do with these extraordinary new datasets!