New Television News Inventory Files

We're excited to announce the new Television News inventory files, which record the complete list of public shows archived by the Internet Archive's Television News Archive and processed as part of the Television Explorer and Television News Ngrams Dataset.

There are two new daily inventory files, one recording the list of all shows monitored that day and the other recording the overall statistics and dominate show for each 30 minute slot on each station that day (making it ideal for analyzing the ngrams dataset).

SHOWLIST

For each day there is a "showlist" file that contains a list of every show monitored that day, even if it was just a few minutes long (and thus may not appear in the "timelist" inventory file for that day).

The files are stored at "http://data.gdeltproject.org/gdeltv3/iatv/inventory/YYYYMMDD.showlist.txt" where YYYYMMDD should be replaced with the day of interest, from the start date of 20090604 through 24 hours from present. Files are updated every 30 minutes as new shows are processed. Thus the showlist inventory file for June 1, 2019 is "http://data.gdeltproject.org/gdeltv3/iatv/inventory/20190601.showlist.txt".

Each row represents one show monitored that day. Note that shows that span across two days (such a show that begins at 11PM one day and stretches one minute into the following day) will be recorded on both days, with their respective counts reflecting how much of the show's programming occurred on each day.

Each row has the following fields (there is no header row):

  • STATION. The station identifier used by the Internet Archive for that station.
  • DATE. The date in YYYYMMDD format. This will always be the same as the filename date, but is included in the file to make it easier for database loading.
  • SHOWID. This is the Internet Archive's unique identifier for this broadcast.
  • STARTTIME. This is the precise start time of the broadcast, to the second, in YYYYMMDDHHMMSS in the UTC timezone. Note that for shows that span across two days, this reflects the start time of the broadcast as a whole.
  • ENDTIME. This is the precise end time of the broadcast, to the second, in YYYYMMDDHHMMSS in the UTC timezone. Note that for shows that span across two days, this reflects the endtime of the broadcast as a whole.
  • TOTALAIRTIME. The total airtime of the broadcast that appeared on this day (including all portions of the broadcast that lacked captioning). For shows that spanned two days, this reflects the airtime of the broadcast on the current day.
  • CAPTIONEDAIRTIME. The total captioned airtime of the broadcast that appeared on this day. Uncaptioned portions of the broadcast (such as most commercials) are not included in this count. For shows that spanned two days, this reflects the captioned airtime of the broadcast on the current day. Note that since closed captioning records only the start time of each captioning line and not its end time, this field is calculated by subtracting each line's start time from the following line's start time and excluding lines where there is more than 15 seconds until the next line (this typically indicates a commercial or other break). Thus, this is an imperfect estimate but is fairly accurate.
  • TOTALWORDS. This is the total number of words in the closed captioning of this show. This count follows the same punctuation rules as used in the ngram dataset. For shows that spanned two days, this reflects the number of words on the current day.
  • UNIQUEWORDS. This is the total number of unique words in the closed captioning of this show. This count follows the same punctuation rules as used in the ngram dataset. For shows that spanned two days, this reflects the number of unique words on the current day.
  • PREVIEWURL. This is the URL to preview the broadcast on the Television News Archive's website.

The dataset is also available in Google's BigQuery:

TIMELIST

For each day there is also a "timelist" file that breaks the day into 30 minute blocks for each station and records the aggregate statistics for each block, along with the primary show airing on that station in that block on that day.

The files are stored at "http://data.gdeltproject.org/gdeltv3/iatv/inventory/YYYYMMDD.timelist.txt" where YYYYMMDD should be replaced with the day of interest, from the start date of 20090604 through 24 hours from present. Files are updated every 30 minutes as new shows are processed. Thus the timelist inventory file for June 1, 2019 is "http://data.gdeltproject.org/gdeltv3/iatv/inventory/20190601.timelist.txt".

Each row represents a station/day/30-minute slot combination. In other words, for a given day there will be up to 48 slots for each station (24 hours at 30 minutes resolution). Note that shows that span across two days (such a show that begins at 11PM one day and stretches one minute into the following day) will be recorded on both days across their respective slots. The semantics of the timelist inventory file is identical to that used for the ngrams dataset.

Each row has the following fields (there is no header row) (note that the ordering of some fields is different than the showlist inventory files):

  • DATE. The date in YYYYMMDD format. This will always be the same as the filename date, but is included in the file to make it easier for database loading.
  • STATION. The station identifier used by the Internet Archive for that station.
  • HOUR. The UTC timezone hour in zero-padded HHSS format from "0000" meaning midnight to "2330" meaning 11:30PM.
  • PRIMARYSHOWID. This is the Internet Archive's unique identifier of the longest broadcast airing during this 30 minute slot. Thus, if the 1-2PM broadcast runs until 2:02PM and the 2:30PM broadcast begins at 2:29PM, there might be three different broadcasts contributing captioning words to the 2:00-2:30PM slot, but this field will reflect the dominate show that occupied the majority of the slot from 2:03PM to 2:28PM. This field can be used to understand which ngram slot reflects which show in order to limit ngram analysis to the 30 minute evening news broadcast on ABC/CBS/NBC for example.
  • TOTALAIRTIME. The total airtime of all broadcasts that aired during this slot (including all portions of the broadcasts that lacked captioning). If multiple broadcasts started/ended during this slot their total airtime will be combined.
  • CAPTIONEDAIRTIME. The total captioned airtime of the broadcast that aired during this slot. Uncaptioned portions of the broadcast (such as most commercials) are not included in this count. If multiple broadcasts started/ended during this slot their total captioned airtime will be combined.  Note that since closed captioning records only the start time of each captioning line and not its end time, this field is calculated by subtracting each line's start time from the following line's start time and excluding lines where there is more than 15 seconds until the next line (this typically indicates a commercial or other break). Thus, this is an imperfect estimate but is fairly accurate.
  • UNIQUEWORDS. This is the total number of unique words in the closed captioning of this show. This count follows the same punctuation rules as used in the ngram dataset. For shows that spanned two days, this reflects the number of unique words on the current day.
  • TOTALWORDS. This is the total number of words in the closed captioning of this show. This count follows the same punctuation rules as used in the ngram dataset. For shows that spanned two days, this reflects the number of words on the current day.

The dataset is also available in Google's BigQuery: