Announcing The Television News Ngram Datasets (TV-NGRAM)

Update (6/7/2020): This dataset has been superseded by the TV Ngram 2.0 Dataset. Users should move their workflows to this new far more powerful dataset.

 

Today we are excited to announce the debut of the new Television News Ngram Datasets, offering one-word (1gram/unigram) and two-word (2gram/bigram) ngram/shingle word histograms at half hour resolution for television news coverage on ABC, Al Jazeera, BBC News, CBS, CNN, DeutscheWelle, FOX, Fox News, NBC, PBS, Russia Today, Telemundo and Univision, using data from the Internet Archive’s Television News Archive. Some of these ngram datasets stretch back more than a decade to 2009 while others begin more recently.

Word histograms represent an ideal mechanism through which to explore large textual datasets like television news. In the case of television, converting closed captioning to half hour word and phrase histograms enables non-consumptive analysis of underlying topical and linguistic patterns without access to the actual captioning itself.

These new non-consumptive datasets open the door to an incredible variety of exciting new kinds of analyses, from normalizing clip-based ngrams to analyzing the ebb-and-flow of the topics and language covered on television news over the past decade. Simply by plotting mentions of each word or phrase on a timeline, all kinds of trend analyses become possible, such as surfacing trending topics and uncovering narratives more popular on one station than another and even identifying words whose usage over time appears to be highly correlated or anti-correlated.

To create these ngrams, the closed captioning of each station is lowercased and aggregated by half hour in the UTC timezone. Shows that span across multiple half hour blocks are split, with each line of captioning assigned to its respective half hour. Words and acronyms that contain internal periods (such as “U.S.A.”) are stripped of those periods. Words that begin with punctuation (the regular expression “[!\"()*+,./:;<=>?[\\]^_`{|}~\-]”) have it stripped off and are treated as bigram dividers (the word is considered the start of a new bigram). Words that end with punctuation (same symbols as above) have it stripped off (except for words containing periods like “a.m.” or “ph.d.”) and the current word is considered part of the current bigram, but the next word starts a new bigram.

Unigrams are computed simply by tallying up the number of unique words on that station in that hour and the total number of times each appeared in the captioning, yielding a simple half hourly word histogram.

Bigrams are computed the same way but use a two-word moving window that resets at each punctuation symbol. Thus, a captioning line “At the White House today, the president…” would yield unigrams of “at”, “the”, “white”, “house”, “today” and “president” and bigrams “at the”, “the white”, “white house”, “house today” and “the president” (since it resets at the comma).

The ngram files are saved into two gzipped daily files per station, one for the 1grams/unigrams (words) and one for the 2grams/bigrams (two-word phrases) in the format YYYYMMDD.STATION.1gram/2gram.txt.gz. Thus, the files for October 1, 2012 for CNN are “20121001.CNN.1gram.txt.gz” and “20121001.CNN.2gram.txt.gz”.

Within each file, each row represents a unique word/phrase and is tab delimited with the following columns (there is no header row):

  • DATE. The date in YYYYMMDD format. This is included in the file to make it easier to load the ngrams as-is into a database for analysis.
  • STATION. The station identifier used by the Internet Archive for that station. This is included in the file to make it easier to load the ngrams as-is into a database for analysis.
  • HOUR. The zero-padded hour in 24-hour format (from “0” indicating midnight to “23” indicating 11PM).
  • WORD. The word or phrase.
  • COUNT. The number of times it was mentioned in that hour on that station during that day.

Note that the files are stored in UTF8 to account for the fact that some stations contain non-ASCII characters.

As with the TV Explorer itself, there is a 24-72-hour delay after each broadcast before it is indexed into the ngram files. Thus, the current day will not have a file and the files for the previous three days are typically incomplete and are updated every 30 minutes as new broadcasts finish processing, in lockstep with the TV Explorer.

This means that the ngram files for the last three days are updated throughout the day as new shows finish processing for those days.

Analysts interested in the most recent news should redownload the ngram files for the previous three days every 30 minutes to get the latest result. Files for days more than three days prior can be considered complete and do not need to be redownloaded.

To track the completion status of each half hour segment to determine when processing for a given day has completed, please consult the new Television News Completion File.

Note that the results you receive from these ngrams may differ slightly from the Television Explorer. One reason is that the Explorer automatically drops “’s” when indexing words and uses ElasticSearch’s default parsing, while this ngrams datasets indexes words exactly as they stand. This means that to analyze mentions of “trump” using the ngrams dataset requires searching for both “trump” and “trump’s.” A bigger factor is that the Explorer breaks broadcasts into 15 second blocks and reports the number of matching blocks. Even if a word is mentioned multiple times within a single 15 second block (such as a campaign rally chant of “trump! trump! trump! trump!”) it will be counted as only a single matching 15 second block by the Explorer, whereas the ngrams dataset will count it as four mentions of “trump” since it operates at the word level. There are also a small number of historical shows that had historically been missing from the Television News Archive that were recently added and are reflected in the ngrams dataset and will be backfilled into the TV Explorer shortly.

To make it easier to download all of the ngram files, a list of the URLs of all of the 1gram and 2gram ngram files is provided below. These lists update each day. Note that some stations have holes in their coverage in the Television News Archive and may be missing a day or two here and there or have days where some shows are missing. There are a few longer gaps of up to several months in some cases due to missing data in the Television News Archive.

 

  • ABC (San Francisco affiliate KGO) (7/15/2010 – present) [FILELIST]
  • Al Jazeera (11/11/2017 – present) [FILELIST]
  • BBC News (1/1/2017 – present) [FILELIST]
  • CBS (San Francisco affiliate KPIX) (7/15/2010 – present) [FILELIST]
  • CNN (7/2/2009 – present) [FILELIST]
  • DeutscheWelle (9/19/2017 – present) [FILELIST]
  • FOX (San Francisco affiliate KTVU) (7/15/2010 – present) [FILELIST]
  • Fox News (7/2/2009 – present) [FILELIST]
  • NBC (San Francisco affiliate KNTV) (7/16/2010 – present) [FILELIST]
  • MSNBC (7/2/2009 – present) [FILELIST]
  • PBS (San Francisco affiliate KQED) (7/14/2010 – present) [FILELIST]
  • Russia Today (7/15/2010 – present) [FILELIST]
  • Telemundo (San Francisco affiliate KSTS) (7/16/2010 – present) [FILELIST]
  • Univision (San Francisco affiliate KDTV) (7/2/2009 – present) [FILELIST]

 

The complete dataset is also available in Google's BigQuery:

 

Note that this is an "alpha" release dataset, meaning we may make a few changes and enhancements over time as we gather user feedback, so check back here periodically.

We are incredibly excited to see what you are able to do with these extraordinary new datasets!