Just how large is Television News Ngram 2.0 Dataset? In all, the dataset comprises 3.12 billion unigrams, 4.98 billion bigrams, 4.1 billion trigrams, 2.9 billion quadgrams and 1.8 billion 5-grams (remember that word shingles do not span across punctuation for our dataset). Thus, in all the complete dataset of 1-5 grams comprises just under 16.9 billion records!
Television News Ngram 2.0 Dataset: 16.9 Billion Records
