Announcing The Television News Ngram 2.0 Dataset

Last year we unveiled the television news ngram dataset, a compilation of 30-minute word frequency histogram tables of unigrams and bigrams for 14 stations, some stretching back more than a decade, totaling almost 740,000 broadcasts in collaboration with the Internet Archive's Television News Archive. Today we are tremendously excited to announce the release of the Television News Ngram 2.0 Dataset! The original 1.0 ngrams will continue to be updated, but this new 2.0 dataset includes an array of powerful new features!

Word histograms represent an ideal mechanism through which to explore large textual datasets like television news. In the case of television, converting closed captioning to half hour word and phrase histograms enables non-consumptive analysis of underlying topical and linguistic patterns without access to the actual captioning itself.

These non-consumptive datasets open the door to an incredible variety of exciting new kinds of analyses, from normalizing clip-based ngrams to analyzing the ebb-and-flow of the topics and language covered on television news over the past decade. Simply by plotting mentions of each word or phrase on a timeline, all kinds of trend analyses become possible, such as surfacing trending topics and uncovering narratives more popular on one station than another and even identifying words whose usage over time appears to be highly correlated or anti-correlated.

The 2.0 release of this dataset has the following major changes:

  • Ngrams are now at 10 minute resolution rather than 30 minute resolution. For a 30 minute evening news broadcast, this means it is now possible to analyze language shifts in the beginning, middle and end of the broadcast to look for things like emotional trajectories (do broadcasts open with sad news and end on a happy note for example)? This increased temporal resolution should make it possible to examine fine-grained temporal shifts in language around breaking events and the overall morning-to-evening shifts in emotional tenor on channels.
  • File format is identical with the addition of a new column at the end that contains the show name. This makes it possible to compare the language used a particular show against that of another show, comparing personality-driven and news shows on a given station or different personalities within or across stations. In addition, in the 1.0 file, if one show overlapped by a few minutes into the next one's time slot, those words would be counted towards that next time slot, making it difficult to segment those words using the show inventory files. Instead, in the 2.0 file, since each show is its own file, if a few minutes of a show overlaps into the next 10 minute segment, those words are recorded under that show's file as expected.
  • Rather than a single daily file containing all ngrams for the day, each individual show now has its own ngram file. This decoupling makes it possible for us to make ngram files available the moment a show finishes processing and allows you to download only the minimum amount of data you wish to analyze, as well as properly handling overlapping show start/end periods.
  • In addition to unigram and bigram files, we are also now generating trigram, quadgram and 5-gram files to make it possible to analyze full word context. Thus, for each show there will now be five ngram files generated, one each for its unigram, bigram, trigram, quadgram and 5-gram files.

To create these ngrams, the closed captioning of each station is lowercased and aggregated by half hour in the UTC timezone. Shows that span across multiple half hour blocks are split, with each line of captioning assigned to its respective half hour. Words and acronyms that contain internal periods (such as "U.S.A.") are stripped of those periods. Words that begin with punctuation (the regular expression "[!\"()*+,./:;<=>?[\\]^_`{|}~\-]") have it stripped off and are treated as bigram/trigram/quadgram/5-gram dividers (the word is considered the start of a new bigram/trigram/quadgram/5-gram). Words that end with punctuation (same symbols as above) have it stripped off (except for words containing periods like "a.m." or "ph.d.") and the current word is considered part of the current bigram/trigram/quadgram/5-gram, but the next word starts a new bigram/trigram/quadgram/5-gram.

Unigrams are computed simply by tallying up the number of unique words spoken in that show's closed captioning transcript in each given 10 minute increment.

Bigrams, trigrams, quadgrams and 5-grams are computed the same way but use a moving window that resets at each punctuation symbol. Thus, a captioning line "At the White House today, the president…" would yield unigrams of "at", "the", "white", "house", "today" and "president" and bigrams "at the", "the white", "white house", "house today" and "the president" (since it resets at the comma), trigrams "at the white", "the white house", "white house today" (the part after the comma with "the president" does not yield a trigram since it is only two words long), quadgrams of "at the white house" and "the white house today" and a solitary 5-gram of "at the white house today".

A separate set of five ngram files (unigram, bigram, trigram, quadgram, 5-gram) are saved for each show in the following format, with each row representing a unique word in that given 10 minute period:

  • DATE. The date in YYYYMMDD format. This is included in the file to make it easier to load the ngrams as-is into a database for analysis.
  • STATION. The station identifier used by the Internet Archive for that station. This is included in the file to make it easier to load the ngrams as-is into a database for analysis.
  • HOUR. The zero-padded hour in 24-hour format (from "0" indicating midnight to "23" indicating 11PM).
  • WORD. The word or phrase.
  • COUNT. The number of times it was mentioned in that hour on that station during that day.
  • SHOW. The name of the show as recorded by the Internet Archive.

Note that the files are stored in UTF8 to account for the fact that some stations contain non-ASCII characters.

To determine which shows are available, a daily inventory file is available in the format YYYYMMDD from July 2, 2009 through present that lists all of the ngram files for that day. For example, for June 1, 2020 the file would be:

  • http://data.gdeltproject.org/gdeltv3/iatv/ngramsv2/20200601.txt

The inventory files for the most recent 72 hours are updated every 30 minutes as shows complete processing.

As with the TV Explorer itself, there is a 24-48-hour delay after each broadcast before it is indexed into the ngram files. Thus, the previous three days are typically incomplete and are updated every 30 minutes as new broadcasts finish processing, in lockstep with the TV Explorer. You can check the available shows for the last few days to see how complete coverage is at a given moment in time as processing completes.

Note that the results you receive from these ngrams may differ slightly from the Television Explorer. One reason is that the Explorer automatically drops "'s" when indexing words and uses ElasticSearch's default parsing, while this ngrams datasets indexes words exactly as they stand. This means that to analyze mentions of "trump" using the ngrams dataset requires searching for both "trump" and "trump's." A bigger factor is that the Explorer breaks broadcasts into 15 second blocks and reports the number of matching blocks. Even if a word is mentioned multiple times within a single 15 second block (such as a campaign rally chant of "trump! trump! trump! trump!") it will be counted as only a single matching 15 second block by the Explorer, whereas the ngrams dataset will count it as four mentions of "trump" since it operates at the word level.

The following stations are processed at present:

  • ABC (San Francisco affiliate KGO) (7/15/2010 – present)
  • Al Jazeera (11/11/2017 – present)
  • BBC News (1/1/2017 – present)
  • Bloomberg (12/5/2013 – present)
  • CBS (San Francisco affiliate KPIX) (7/15/2010 – present)
  • CNBC (7/2/2009 – present)
  • CNN (7/2/2009 – present)
  • CSPAN (6/4/2009 – present)
  • CSPAN2 (6/4/2009 – present)
  • CSPAN3 (1/26/2012 – present)
  • DeutscheWelle (9/19/2017 – present)
  • FOX (San Francisco affiliate KTVU) (7/15/2010 – present)
  • Fox Business (8/20/2012 – present)
  • Fox News (7/2/2009 – present)
  • LinkTV (2/20/2012 – present)
  • MyNetworkTV (KRON) (7/14/2010 – present)
  • NBC (San Francisco affiliate KNTV) (7/16/2010 – present)
  • MSNBC (7/2/2009 – present)
  • PBS (San Francisco affiliate KQED) (7/14/2010 – present)
  • San Francisco Government TV (SFGTV) (7/15/2010 – present)
  • Russia Today (7/15/2010 – present)
  • Telemundo (San Francisco affiliate KSTS) (7/16/2010 – present)
  • Univision (San Francisco affiliate KDTV) (7/2/2009 – present)

All ngram files can be downloaded as gzipped TSV files per the inventory files above and are also available in a trio of BigQuery tables:

We're incredibly excited to see what you're able to do with this immense new dataset!