The GDELT Project

Announcing The Television News Ngram 2.0 Dataset

Last year we unveiled the television news ngram dataset, a compilation of 30-minute word frequency histogram tables of unigrams and bigrams for 14 stations, some stretching back more than a decade, totaling almost 740,000 broadcasts in collaboration with the Internet Archive's Television News Archive. Today we are tremendously excited to announce the release of the Television News Ngram 2.0 Dataset! The original 1.0 ngrams will continue to be updated, but this new 2.0 dataset includes an array of powerful new features!

Word histograms represent an ideal mechanism through which to explore large textual datasets like television news. In the case of television, converting closed captioning to 10-minute word and phrase histograms enables non-consumptive analysis of underlying topical and linguistic patterns without access to the actual captioning itself.

These non-consumptive datasets open the door to an incredible variety of exciting new kinds of analyses, from normalizing clip-based ngrams to analyzing the ebb-and-flow of the topics and language covered on television news over the past decade. Simply by plotting mentions of each word or phrase on a timeline, all kinds of trend analyses become possible, such as surfacing trending topics and uncovering narratives more popular on one station than another and even identifying words whose usage over time appears to be highly correlated or anti-correlated.

The 2.0 release of this dataset has the following major changes:

To create these ngrams, the closed captioning of each station is lowercased and aggregated by 10 minutes in the UTC timezone. Words that begin with punctuation (the regular expression "[!\"()*+,./:;<=>?[\\]^_`{|}~\-]") have it stripped off and are treated as bigram/trigram/quadgram/5-gram dividers (the word is considered the start of a new bigram/trigram/quadgram/5-gram). Words that end with punctuation (same symbols as above) have it stripped off (except for words containing periods like "a.m." or "ph.d.") and the current word is considered part of the current bigram/trigram/quadgram/5-gram, but the next word starts a new bigram/trigram/quadgram/5-gram.

Unigrams are computed simply by tallying up the number of unique words spoken in that show's closed captioning transcript in each given 10 minute increment.

Bigrams, trigrams, quadgrams and 5-grams are computed the same way but use a moving window that resets at each punctuation symbol. Thus, a captioning line "At the White House today, the president…" would yield unigrams of "at", "the", "white", "house", "today" and "president" and bigrams "at the", "the white", "white house", "house today" and "the president" (since it resets at the comma), trigrams "at the white", "the white house", "white house today" (the part after the comma with "the president" does not yield a trigram since it is only two words long), quadgrams of "at the white house" and "the white house today" and a solitary 5-gram of "at the white house today".

A separate set of five ngram files (unigram, bigram, trigram, quadgram, 5-gram) are saved for each show in the following format, with each row representing a unique word in that given 10 minute period:

Note that the files are stored in UTF8 to account for the fact that some stations contain non-ASCII characters.

To determine which shows are available, a daily inventory file is available in the format YYYYMMDD from July 2, 2009 through present that lists all of the ngram files for that day. For example, for June 1, 2020 the file would be:

The inventory files for the most recent 72 hours are updated every 30 minutes as shows complete processing.

As with the TV Explorer itself, there is a 24-48-hour delay after each broadcast before it is indexed into the ngram files. Thus, the previous three days are typically incomplete and are updated every 30 minutes as new broadcasts finish processing, in lockstep with the TV Explorer. You can check the available shows for the last few days to see how complete coverage is at a given moment in time as processing completes.

Note that the results you receive from these ngrams may differ slightly from the Television Explorer. One reason is that the Explorer automatically drops "'s" when indexing words and uses ElasticSearch's default parsing, while this ngrams datasets indexes words exactly as they stand. This means that to analyze mentions of "trump" using the ngrams dataset requires searching for both "trump" and "trump's." A bigger factor is that the Explorer breaks broadcasts into 15 second blocks and reports the number of matching blocks. Even if a word is mentioned multiple times within a single 15 second block (such as a campaign rally chant of "trump! trump! trump! trump!") it will be counted as only a single matching 15 second block by the Explorer, whereas the ngrams dataset will count it as four mentions of "trump" since it operates at the word level.

The following stations are processed at present:

All ngram files can be downloaded as gzipped TSV files per the inventory files above and are also available in a trio of BigQuery tables:

We're incredibly excited to see what you're able to do with this immense new dataset!