Announcing Per-Second Television News Research Chyrons Dataset

Kalev Leetaru

2 years ago

Last month we announced a new research-grade chyron / "lower third" dataset for BBC News, CNN, MSNBC and Fox News covering August 25, 2017 through present, created by reprocessing the daily "Third Eye" chyron dataset from the Internet Archive's Television News Archive with a new workflow that uses language models and sophisticated clustering to reanalyze the Archive's OCR output to create a research-grade dataset that includes all of the onscreen chyron text, with each cluster filtered to its most legible version, rather than filtering to excerpts that remove much of the text.

The end result is a minute-by-minute chronology of the onscreen chyron text of the four stations over the past three years.

Minute-level resolution is ideal for manual comparisons of agenda setting and framing, but certain high-precision applications like chyron search require more precise second-level information that records the exact number of seconds each chyron was onscreen.

Today we are excited to announce a new per-second television news research chyrons dataset that does exactly this!

The new files follow a similar format to the per-minute TSV files, but instead of concatenating together all of the unique text blocks onscreen during a given minute, it breaks each minute into individual clusters representing a distinct grouping of text. Thus, if a given minute alternates between "Congress voting on Trump impeachment" and "Breaking: Trump impeached" every 10 seconds, the minute would be represented as two rows, one for each of the two clusters, with each listed as being onscreen for 30 seconds during that minute.

The final files are named "lowerthirdclusters.tsv" and have the following fields:

DateTime. The date and time in UTC. A value of "2019-11-19 00:23" indicates 23 minutes after midnight UTC time on November 19, 2019 and reflects the chyrons onscreen from 00:23:00 to 00:23:59, inclusive.
Station. The station name (currently one of "CNN", "MSNBC", "Fox News" or "BBC News").
Show Name. The name of the broadcast, allowing for filtering to specific programs and personalities.
Clip URL. The URL of the Internet Archive Television News Archive page for the specific one-minute clip, allowing you to see the actual chyrons (especially useful to verify/correct OCR error). Note that clips within the most recent 24 hours will not be available.
Seconds. The number of seconds this block of chyron text was onscreen during the given minute. Chyron text that spans across minutes will be reported separately for each of the minutes, with this field reporting how many seconds it was onscreen during each of the respective minutes.
Text. This is the actual chyron text block.

For example, the file for November 19, 2019 during Lt. Col. Vindman's testimony would be:

http://data.gdeltproject.org/gdeltv3/iatv/lowerthird/2019-11-19.lowerthirdcluster.tsv

The dataset is also available as a table in BigQuery:

gdelt-bq:gdeltv2.iatv_lowerthirdclusters

For more details on the underlying reprocessing workflow, see the original announcement last month. We're tremendously excited to see what you're able to accomplish with this new dataset!