Television News Readability Scores Dataset 2009-2020

UPDATE: See this deep dive on the influence of captioner shifts and the difference between transcripts and captioning in results.

Earlier this week we unveiled a massive new dataset of readability scores for worldwide English language online news coverage monitored by GDELT in 2020. Today we're tremendously excited to unveil its correlate for television news, using their closed captioning transcripts, using data from the Internet Archive's Television News Archive!

Identical to our web readability scores, we used the GNU style utility to compute readability statistics and provide its output as-is. There are significant open methodological questions in applying readability metrics designed for written text to transcripts of spoken word content, especially surrounding how the human transcriptionists divided each series of utterances into sentences. In some cases, one channel's transcriptionists might divide its spoken words into lengthy sentences of many words, while other channel might break its captioning into many short sentences, which would overwhelm all other considerations in many readability metrics, since many revolve around sentence-based counts. Thus, the readability scores here are enormously influenced by the decisions made by each channel on how to divide its daily narration and dialog into discrete sentence units. There are also unresolved questions regarding better approaches to subdividing continuous speech into discrete temporal units that divide sentences at arbitrary boundaries. Ideally, commercials could be entirely separated from news content, while here captioned commercials influence the results. These issues aside, we hope this dataset inspires a broader discussion around how to understand the comprehensibility of television news and its ability to communicate complex topics to the public, especially around issues of major public concern such as the Covid-19 pandemic.

Three sets of readability scores are computed for each broadcast (all three are included in each file, with the "res" field in the dataset used to distinguish each):

  • Broadcast. The entire captioning of a broadcast is concatenated together and "style" run to compute a broadcast-level scores.
  • 10 Minute. The broadcast is divided into 10 minute segments with boundaries identical to that used for the Television News NGrams 2.0 dataset, allowing analyses that use the ngram dataset to identify relevant 10 minute segments and then use this dataset to access their readability scores. Some channels use extremely long sentences that can span lengthy portions of two bordering minutes, thus instead of aggregating by sentence, we aggregate by captioning line. This means that a given sentence that is split across multiple captioning lines that span two 10 minute segments will be split between those two segments. If the text in a given 10 minute segment does not end with a punctuation character, one is added so that style considers the last "sentence" in its results. These segments can often yield around 700-800 words per textual passage analyzed by style.
  • 1 Minute. The broadcast is divided into 1 minute segments allowing fine-grained temporal analysis of a broadcast. This allows use of the TV 2.0 API to identify all of the minutes mentioning a given topic such as Covid-19 and the average readability scores of those minutes can be compared with the baseline of surrounding minutes for that day. The actual process of handling captioning line boundaries is identical to that used for 10 minute segments. Note that this can yield as low as 100-200 words per textual passage analyzed by style, yielding extreme variations from minute to minute through a broadcast.

Each broadcast's readability scores are in their own file in JSON-NL format with the following format:

  • date. The precise time to the minute in UTC of the start of the relevant block. For "show" this is the start of the show itself. For "10min" and "1min" blocks, this is the start of that block of time.
  • res. The resolution of the given readability scores. Either "show" for the broadcast-level scores, "10min" for the 10 minute scores or "1min" for the 1 minute scores.
  • iaId. The unique Internet Archive Television News Archive identifier that uniquely identifies the given broadcast.
  • station. The specific channel the broadcast appeared on (such as "CNN").
  • show. The name of the broadcast's show (such as "Headline News" for CNN), enabling show-level readability comparisons.
  • readabilityScores. A JSON-escaped string containing the complete raw output of GNU style (see the web readability scores for its format). This is the literal output of style as-is.

For example, here is the entry for one minute of CNN at 4:26PM UTC on November 20, 2020:

{
"date": "2020-11-20 16:26:00 UTC",
"res": "1min",
"iaId": "CNNW_20201120_160000_CNN_Newsroom",
"station": "CNN",
"show": "CNN Newsroom",
"readabilityScores": "readability grades:\n Kincaid: 5.1\n ARI: 6.2\n Coleman-Liau: 9.5\n Flesch Index: 76.3/100\n Fog Index: 7.4\n Lix: 35.4 = school year 5\n SMOG-Grading: 8.0\nsentence info:\n 625 characters\n 130 words, average length 4.81 characters = 1.42 syllables\n 13 sentences, average length 10.0 words\n 38% (5) short sentences (at most 5 words)\n 15% (2) long sentences (at least 20 words)\n 1 paragraphs, average length 13.0 sentences\n 7% (1) questions\n 15% (2) passive sentences\n longest sent 20 wds at sent 4; shortest sent 2 wds at sent 2\nword usage:\n verb types:\n to be (2) auxiliary (4) \n types as % of total:\n conjunctions 10% (13) pronouns 10% (13) prepositions 8% (11)\n nominalizations 1% (1)\nsentence beginnings:\n pronoun (0) interrogative pronoun (1) article (0)\n subordinating conjunction (1) conjunction (1) preposition (0)\n"
}

Here is the human-readable version of the readabilityScores field:

readability grades:
     Kincaid: 5.1
     ARI: 6.2
     Coleman-Liau: 9.5
     Flesch Index: 76.3/100
     Fog Index: 7.4
     Lix: 35.4 = school year 5
     SMOG-Grading: 8.0
sentence info:
     625 characters
     130 words, average length 4.81 characters = 1.42 syllables
     13 sentences, average length 10.0 words
     38% (5) short sentences (at most 5 words)
     15% (2) long sentences (at least 20 words)
     1 paragraphs, average length 13.0 sentences
     7% (1) questions
     15% (2) passive sentences
     longest sent 20 wds at sent 4; shortest sent 2 wds at sent 2
word usage:
     verb types:
     to be (2) auxiliary (4) 
     types as % of total:
     conjunctions 10% (13) pronouns 10% (13) prepositions 8% (11)
     nominalizations 1% (1)
sentence beginnings:
     pronoun (0) interrogative pronoun (1) article (0)
     subordinating conjunction (1) conjunction (1) preposition (0)

We compute readability scores for all of the same channels we do for the Television News NGrams 2.0 dataset:

  • ABC (San Francisco affiliate KGO) (7/15/2010 – present)
  • Al Jazeera (11/11/2017 – present)
  • BBC News (1/1/2017 – present)
  • Bloomberg (12/5/2013 – present)
  • CBS (San Francisco affiliate KPIX) (7/15/2010 – present)
  • CNBC (7/2/2009 – present)
  • CNN (7/2/2009 – present)
  • CSPAN (6/4/2009 – present)
  • CSPAN2 (6/4/2009 – present)
  • CSPAN3 (1/26/2012 – present)
  • DeutscheWelle (9/19/2017 – present)
  • FOX (San Francisco affiliate KTVU) (7/15/2010 – present)
  • Fox Business (8/20/2012 – present)
  • Fox News (7/2/2009 – present)
  • LinkTV (2/20/2012 – present)
  • MyNetworkTV (KRON) (7/14/2010 – present)
  • NBC (San Francisco affiliate KNTV) (7/16/2010 – present)
  • MSNBC (7/2/2009 – present)
  • PBS (San Francisco affiliate KQED) (7/14/2010 – present)
  • San Francisco Government TV (SFGTV) (7/15/2010 – present)
  • Russia Today (7/15/2010 – present)
  • Telemundo (San Francisco affiliate KSTS) (7/16/2010 – present)
  • Univision (San Francisco affiliate KDTV) (7/2/2009 – present)

Note that at this time scores were computed through November 25, 2020 and are not currently being updated as we gauge interest in these kinds of readability scores and learn from all of you the kinds of scores you're most interested in seeing. Note also that some of the channels above broadcast some or most of their content in a language other than English. The GNU style readability scores are designed for English meaning the scores for non-English content are not methodologically valid and should be ignored, we did not perform per-broadcast language detection to filter out those broadcasts.

The entire dataset can be downloaded as UTF8 JSON-NL files, one per broadcast. To determine which shows are available, a daily inventory file is available in the format YYYYMMDD from July 2, 2009 through present that lists all of the readability scores for that day. For example, for June 1, 2020 the file would be:

  • http://data.gdeltproject.org/gdeltv3/iatv/readability/20200601.txt

The entire dataset is also available in BigQuery:

We're incredibly excited to see what kinds of powerful new research questions become possible with this new dataset!