New Television News Completion File

As the Television Explorer is increasingly used to understand the reaction to breaking news and evolving narratives, the unpredictable nature of its updates can make it difficult to estimate the precise cutoff for each station, meaning the point in time where all of the content aired prior to that moment has been processed and is now searchable. Ideally, the Internet Archive's Television News Archive uses a rolling 24-hour cutoff where content becomes searchable approximately 24 hours after it has aired. In practice, various factors such as server load and video resolution mean shows can sometimes take up to 72 hours to finish processing, especially longer multi-hour shows. While these cases are rare, they make it difficult to determine whether a given station is fully "caught up" to 24 hours ago.

To make it easier for users to know the status of the Television Explorer and other TV-related services, we are debuting today the new Television News Completion File, which is a simple ASCII file that tracks the processing status of each of the stations the Television News Archive currently monitors over the last four days in 30 minute increments and the status of that given time slot, making it trivial to see which stations still have shows being processed and which are updated to precisely 24 hours ago.

The file format is very simple, containing status entries for each half hour block for each currently monitored station over the last four days. It is a standard ASCII file with four columns (note there is no header row in the actual file):

    • DATE. The date in YYYYMMDD format, from the previous day through four days ago.
    • STATION. The station identifier used by the Internet Archive's Television News Archive.
    • HOUR. The zero-padded hour in 30 minute resolution from "0000" for midnight to "2330" for 11:30PM. This is in the UTC timezone.
    • STATUS. This contains one of three values:
      • DONE. Means there was content in that 30 minute slot that was successfully processed by the Television News Archive. Note that this does not necessarily mean there was valid captioning for that slot – a show that does not contain closed captioning will still be recorded as "DONE" here.
      • DONE-NOCONTENT. Not all stations have 24-hour broadcast schedules. Some go of the air during certain parts of the evening and early morning, contain non-news programming such as paid infomercials during that period or are not captured by the Archive during that period. Each half hour slot is compared to the same slot on the same day of the week each of the previous five weeks. If at least three of the five weeks have no content in this slot, it is assumed to be a "dead" slot and such slots that are more than 24 hours ago will be marked with this status to indicate that while they do not contain any content, they should be assumed to be "complete" from the standpoint of processing, since no content is expected to appear in that slot.
      • EMPTY FIELD. If the status field is blank it means that no content has been seen for that slot yet and that in at least 3 of the last 5 weeks there was content on that station during that slot during that day of the week. These are slots that have not yet been processed but are expecting to see content. Isolated spans of empty slots typically mean a show that has not yet been processed. Empty slots during the previous day will typically fill in, while empty slots two to three days ago could theoretically fill in, though more commonly indicate shows that for some reason were not processed by the Archive and are typically "lost." Users that have specific analytic requirements can wait up to 96 hours to see if the show eventually completes processing, but typically shows that have not been processed by 48 hours can be considered missing.

The file is updated every 30 minutes around the clock and uses the HTTP header "Cache-Control: private" to prevent caching so that downloads should always yield the latest copy. Users should download it at 15 minutes after the hour and 45 minutes after the hour to get the latest version.

For most use cases, users should scan the file backwards for each station of interest, taking the slot that is 24 hours prior and scanning backwards until at least 6 hours worth of slots have either "DONE" or "DONE-NOCONTENT" in them with no empty slots between them. If there are a few empty earlier slots for that station, users can decide whether to work backwards until there are no empty slots or just accept a missing slot here or there in order to be able to analyze events as close to 24 hours ago as possible.