Since August 2017, the Internet Archive's Television News Archive has compiled a daily chyron ("lower third") dataset called "Third Eye," created by extracting the lower third portion of each broadcast second by second and OCR'ing it to extract the text. Most researchers and journalists today access this dataset through its cleaned "tweets" summary edition, which extracts small succinct extracts each minute that are human friendly, but can remove considerable detail that changes the context and meaning of what's onscreen each minute. Today we are excited to announce a new research chyron dataset designed specifically for researchers and journalists that provides the full onscreen chyron text present each minute on CNN, MSNBC, Fox News and BBC News from August 25, 2017 through present, updated daily, created by building on their incredible work and reprocessing the underlying raw Third Eye stream with a new workflow designed specifically for research use.
The Television News Archive's daily chyron dataset offers an incredibly rich editorialized narrative to the day's news events. While the closed captioning searched by the Television Explorer captures what was said on the air, chyrons reflect a parallel summarization and editorialization process in which news outlets produce their own one-sentence summaries of what is happening at the moment. For major live events, the closed captioning across CNN, MSNBC and Fox News is likely to be identical, but chyrons will often reflect their very different distinct "takes" and interpretations of what is being said.
Today most researchers and journalists use the cleaned "tweets" summary edition produced by the Archive, which was originally developed to power a Twitter bot that tweeted headlines of the moment through the day. This dataset was designed to extract the single most succinct line from each minute of captioning to post to Twitter, rather than record a maximal-fidelity version of the totality of the OCR'd chyrons for that minute. This means that it often discards a substantial amount of text that can fundamentally change the context of a passage, removing critical detail or even entirely omitting a running narrative, making it appear as if the station did not cover a topic at all during a particular period. While this is ideal behavior for the Twitter bot constraints the "tweets" dataset was designed for, it is more problematic for research and journalistic use in which it is important to understand the full context of a statement or know whether a station mentioned a particular topic that day. In short, the "tweets" dataset was designed to extract a human-friendly summary for Twitter, rather than surface the totality of available text for deep analysis.
For example, at 9:52AM UTC on November 20, 2019, the "tweets" chyron dataset for Fox News contains only a single sentence "MOINES REGISTER/CNN/MEDIACOM, IOWA POLL OF 5OO LIKELY 2O2O DEMOCRATIC. CAUCUS-GOERS, NOV 8-13, MOE +/- 4.4%". There is no mention of Buttigieg, impeachment, the debate stage or Robin Biro. The new research chyron dataset contains "IMPEACHMEN INQUIRY EA REHGS ' BUTTIGIEG TAKES DEBATE STAGE AS IA FRONT-RUNNER I – BUTTIGIEG TAKES DEBATE STAGE AS IA FRONT- RUNNER | A Ilcnv Eula FEW anon n | BUTTIGIEG TAKES DEBATE STAGE AS IA FRONT- RUNNER | II I II R I Has 'LI/EAVRIEMIC mnm' anon 'n T ' m ,- MOINES REGISTER/CNN/MEDIACOM, IOWA POLL OF 500 LIKELY 2020 DEMOCRATIC CAUCUS-GOERS, NOV 8-13, MOE +/- 4 4% IMPEAVCHMEN ' It- ROBIN BIRO | FMR OBAMA-BIDEN CAMPAIGN OFFICIAL INQUIRY v as BUTTIGIEG TAKES DEBATE STAGE AS |A FRONT-RUNNER l- "3 9 ROBIN BIRO | FMR OBAMA- BIDEN CAMPAIGN OFFICIAL 'IMPEACHMEN ' INQUIRY BUITIGIEG TAKES DEBATE STAGE AS IA FRONT- RUNNER I BUTTIGIEG TAKES DEBATE STAGE AS IA FRONT- RUNNER | – HEARINGS Lllcnvuculcm'mon'n '4 ,7".
While it contains a high level of OCR error, it also captures that Fox News mentioned Buttigieg and impeachment repeatedly.
Similarly, at 6:40AM UTC on November 20, 2019 the "tweets" chyron dataset for Fox News contains only the clinical statement "MEDIA SPECULATES ABOUT PRESIDENT TRUMP'S HEALTH\nJOE CONCHA. MEDIA REPORTER FOR THE HILL". This chyron is relatively clinical and matter-of-fact. The new research chyron dataset for the same time slot reports the broader context of "I MEDIA SPECULATES ABOUT PRESIDENT TRUMP' 5 HEALTH WW WITH ANOTHER HOAX COLLAPSING, MEDIA SHIFTS BACK TO CONSPIRACY THEORIES' ABOUT TRUMP' S HEALTH My JOE CONCHA MEDIA REPORTER FOR THE HILL". This time the statement is seen in the broader context of "conspiracy theories" and a "hoax", dramatically changing its meaning from a mere clinical statement.
This new research chyron dataset is designed to capture the complete onscreen chyron text of each television news minute by reprocessing the the underlying raw Third Eye data using a new research-oriented workflow. This means it will contain a vastly larger amount of OCR error, but ensures that researchers and journalists are able to access as close as possible the complete onscreen chyron text each minute produced by Third Eye. Even in cases where the entire minute of chyron text is nearly indecipherable, perhaps with just a few words or a short passage of understandable text, this research chyrons dataset will include it if the OCR error was stable, rather than discard it as the "tweets" dataset does, to allow researchers to discern any meaningful patterns in the text.
In short, this dataset will contain a very high amount of OCR error, but with the benefit being that it provides access to the complete context of each minute of chyron text, rather than extracting only a single succinct sentence.
The new research chyron dataset is available in two versions: HTML web viewer and TSV, available each morning by 1:30AM UTC. The TSV file contains five columns:
- DateTime. The date and time in UTC. A value of "2019-11-19 00:23" indicates 23 minutes after midnight UTC time on November 19, 2019 and reflects the chyrons onscreen from 00:23:00 to 00:23:59, inclusive.
- Station. The station name (currently one of "CNN", "MSNBC", "Fox News" or "BBC News").
- Show Name. The name of the broadcast, allowing for filtering to specific programs and personalities.
- Clip URL. The URL of the Internet Archive Television News Archive page for the specific one-minute clip, allowing you to see the actual chyrons (especially useful to verify/correct OCR error). Note that clips within the most recent 24 hours will not be available.
- Text. This is the actual combined chyron text for that minute.
The web viewer is a simple static HTML web page with five columns (the time in UTC and the four stations) and a row for each minute of chyron text. The chyrons for all four stations are shown across in a single row for each minute, making it easy to compare what the four stations were saying at a given moment. Beneath each chyron is a link to the Internet Archive Television News Archive page for the specific one-minute clip, allowing you to see the actual chyrons (especially useful to verify/correct OCR error). Note that clips within the most recent 24 hours will not be available.
Both files are available in the format below (replace "YYYY-MM-DD" with a date between "2017-08-25" and one day ago:
For example, the files for November 19, 2019 during Lt. Col. Vindman's testimony would be:
We're incredibly excited to see what you're able to do with this new dataset and hope it greatly expands access to the Archive's tremendous Third Eye dataset! Stay tuned for a user-friendly viewing interface and search system coming shortly!
The dataset is compiled by taking the underlying raw 1fps OCR results and clustering them by minute using edit distance. The full text of each per-second snapshot is compared against every other snapshot for that station in that minute and collapsed into clusters based on total edit distance, with a target intra-cluster distance of 84% similarity (sentences with a greater edit distance are broken into their own clusters). This ensures that the majority of variants of a given sentence are grouped together, making it as invariant as possible to routine random and systematic OCR error. Passages that appeared onscreen for 2 seconds or less in any form (across all clustered variants) are discarded. This ensures that true random OCR error is discarded, while stable OCR error is preserved. All variants within a cluster are then scored using CLD2 in "best guess" mode and are ranked by their raw unnormalized CLD2 score of the highest language match (since short chyron passages, especially proper names, will frequently yield a best guess of a language other than English). The highest-scored passage is then selected as the most representative of that cluster. The use of CLD2 to score passages means that the version with the fewest OCR errors is typically selected. Frequently there may be no single "best" version of a passage (such as "politicians were eledted" versus "politicans were elected") in which case CLD2 will select the version it views as "overall" better.
This combination of edit distance clustering and CLD2 score ranking yields a highly reasonable reproduction of the onscreen chyron text for a given minute. While this text may be riddled with OCR error and in some cases may contain no recognizable English words, it represents the most stable and "best" version of the text for that minute.
After extensive experimentation with automated OCR correction and more aggressive discarding, we've decided for the time being not to perform further refinement of the resulting passages. One challenge to automated correction is that systematic OCR error from the font selection of the four stations and the heavy focus on proper names and "turns of phrase" favored by chyron editors frequently runs counter to the correction biases of many traditional OCR correction algorithms, which are designed for the more common fonts found in traditional printed matter and tuned for general English use.
In particular, names like "Buttigieg" and "Vindman" are frequently incorrectly flagged as OCR error, while non-European-descent names are also frequently flagged as error and "corrected" to similar English words. The short snippets and lack of context make it hard for correction algorithms to understand the words as proper names. While daily histograms from closed captioning could be used to build realtime language models that contain many of the most common names from the day, the names and turns of phrase present in the chyrons are not always well represented in closed captioning. That said, we are exploring a number of approaches to refining the chyron text, so stay tuned!