The GDELT Project

Television News Readability Scores Dataset 2009-2020

UPDATE: See this deep dive on the influence of captioner shifts and the difference between transcripts and captioning in results.

Earlier this week we unveiled a massive new dataset of readability scores for worldwide English language online news coverage monitored by GDELT in 2020. Today we're tremendously excited to unveil its correlate for television news, using their closed captioning transcripts, using data from the Internet Archive's Television News Archive!

Identical to our web readability scores, we used the GNU style utility to compute readability statistics and provide its output as-is. There are significant open methodological questions in applying readability metrics designed for written text to transcripts of spoken word content, especially surrounding how the human transcriptionists divided each series of utterances into sentences. In some cases, one channel's transcriptionists might divide its spoken words into lengthy sentences of many words, while other channel might break its captioning into many short sentences, which would overwhelm all other considerations in many readability metrics, since many revolve around sentence-based counts. Thus, the readability scores here are enormously influenced by the decisions made by each channel on how to divide its daily narration and dialog into discrete sentence units. There are also unresolved questions regarding better approaches to subdividing continuous speech into discrete temporal units that divide sentences at arbitrary boundaries. Ideally, commercials could be entirely separated from news content, while here captioned commercials influence the results. These issues aside, we hope this dataset inspires a broader discussion around how to understand the comprehensibility of television news and its ability to communicate complex topics to the public, especially around issues of major public concern such as the Covid-19 pandemic.

Three sets of readability scores are computed for each broadcast (all three are included in each file, with the "res" field in the dataset used to distinguish each):

Each broadcast's readability scores are in their own file in JSON-NL format with the following format:

For example, here is the entry for one minute of CNN at 4:26PM UTC on November 20, 2020:

{
"date": "2020-11-20 16:26:00 UTC",
"res": "1min",
"iaId": "CNNW_20201120_160000_CNN_Newsroom",
"station": "CNN",
"show": "CNN Newsroom",
"readabilityScores": "readability grades:\n Kincaid: 5.1\n ARI: 6.2\n Coleman-Liau: 9.5\n Flesch Index: 76.3/100\n Fog Index: 7.4\n Lix: 35.4 = school year 5\n SMOG-Grading: 8.0\nsentence info:\n 625 characters\n 130 words, average length 4.81 characters = 1.42 syllables\n 13 sentences, average length 10.0 words\n 38% (5) short sentences (at most 5 words)\n 15% (2) long sentences (at least 20 words)\n 1 paragraphs, average length 13.0 sentences\n 7% (1) questions\n 15% (2) passive sentences\n longest sent 20 wds at sent 4; shortest sent 2 wds at sent 2\nword usage:\n verb types:\n to be (2) auxiliary (4) \n types as % of total:\n conjunctions 10% (13) pronouns 10% (13) prepositions 8% (11)\n nominalizations 1% (1)\nsentence beginnings:\n pronoun (0) interrogative pronoun (1) article (0)\n subordinating conjunction (1) conjunction (1) preposition (0)\n"
}

Here is the human-readable version of the readabilityScores field:

readability grades:
     Kincaid: 5.1
     ARI: 6.2
     Coleman-Liau: 9.5
     Flesch Index: 76.3/100
     Fog Index: 7.4
     Lix: 35.4 = school year 5
     SMOG-Grading: 8.0
sentence info:
     625 characters
     130 words, average length 4.81 characters = 1.42 syllables
     13 sentences, average length 10.0 words
     38% (5) short sentences (at most 5 words)
     15% (2) long sentences (at least 20 words)
     1 paragraphs, average length 13.0 sentences
     7% (1) questions
     15% (2) passive sentences
     longest sent 20 wds at sent 4; shortest sent 2 wds at sent 2
word usage:
     verb types:
     to be (2) auxiliary (4) 
     types as % of total:
     conjunctions 10% (13) pronouns 10% (13) prepositions 8% (11)
     nominalizations 1% (1)
sentence beginnings:
     pronoun (0) interrogative pronoun (1) article (0)
     subordinating conjunction (1) conjunction (1) preposition (0)

We compute readability scores for all of the same channels we do for the Television News NGrams 2.0 dataset:

Note that at this time scores were computed through November 25, 2020 and are not currently being updated as we gauge interest in these kinds of readability scores and learn from all of you the kinds of scores you're most interested in seeing. Note also that some of the channels above broadcast some or most of their content in a language other than English. The GNU style readability scores are designed for English meaning the scores for non-English content are not methodologically valid and should be ignored, we did not perform per-broadcast language detection to filter out those broadcasts.

The entire dataset can be downloaded as UTF8 JSON-NL files, one per broadcast. To determine which shows are available, a daily inventory file is available in the format YYYYMMDD from July 2, 2009 through present that lists all of the readability scores for that day. For example, for June 1, 2020 the file would be:

The entire dataset is also available in BigQuery:

We're incredibly excited to see what kinds of powerful new research questions become possible with this new dataset!