The GDELT Project

Television News Ngram 2.0 Dataset: 16.9 Billion Records

Just how large is Television News Ngram 2.0 Dataset? In all, the dataset comprises 3.12 billion unigrams, 4.98 billion bigrams, 4.1 billion trigrams, 2.9 billion quadgrams and 1.8 billion 5-grams (remember that word shingles do not span across punctuation for our dataset). Thus, in all the complete dataset of 1-5 grams comprises just under 16.9 billion records!

Learn More.