The Methodological Challenges Of Interpreting Readability Scores For Spoken Word Television News Transcripts

Yesterday we announced the new Television News Readability Scores Dataset using data from the Internet Archive's Television News Archive. One of the methodological challenges we noted in the creation of this dataset is the imperfect alignment of the stream-of-consciousness spoken word transcripts in which punctuation is artificially inserted by human transcriptionists against the strong dependence of many readability statistics on precise and accurate sentence boundaries. If one human transcriptionist breaks a broadcast into many small sentences, while another strings them together into long comma-delimited sentences, the same announcement might yield two very different readability scores. Breaking broadcasts into artificial temporal blocks of one and ten minutes imposes yet more methodological complications in interpreting the results.

An example of this can be found in the way that some readability scores violate their traditional min/max boundaries. For example, the Flesch Index should have a maximum value of 100, yet there are entries in the television news dataset that violate this constraint.

The following BigQuery query will identify such records for the 10 minute resolution scores:

select score, * from (
SELECT CAST(REGEXP_EXTRACT(readabilityScores, r'Flesch Index: (.*?)/') AS FLOAT64) score, * FROM `gdelt-bq.gdeltv2.readability_iatv` WHERE station='CNN' and DATE(date) = '2009-07-02' and res='10min'
) where score>100

This yields entries like the following, for a 10 minute segment beginning at 19:10 UTC on July 2, 2009 on CNN:

{
"score": "111.4",
"score_1": "111.4",
"date": "2009-07-02 19:10:00 UTC",
"res": "10min",
"iaId": "CNN_20090702_190000_CNN_Newsroom",
"station": "CNN",
"show": "CNN Newsroom",
"readabilityScores": "readability grades:\n Kincaid: 0.4\n ARI: 3.5\n Coleman-Liau: 5.9\n Flesch Index: 111.4/100\n Fog Index: 4.3\n Lix: 26.3 = below school year 5\n SMOG-Grading: 3.0\nsentence info:\n 399 characters\n 96 words, average length 4.16 characters = 1.00 syllables\n 9 sentences, average length 10.7 words\n 22% (2) short sentences (at most 6 words)\n 0% (0) long sentences (at least 21 words)\n 1 paragraphs, average length 9.0 sentences\n 0% (0) questions\n 77% (7) passive sentences\n longest sent 17 wds at sent 9; shortest sent 6 wds at sent 4\nword usage:\n verb types:\n to be (9) auxiliary (0) \n types as % of total:\n conjunctions 2% (2) pronouns 23% (22) prepositions 12% (12)\n nominalizations 2% (2)\nsentence beginnings:\n pronoun (6) interrogative pronoun (0) article (0)\n subordinating conjunction (0) conjunction (0) preposition (0)\n"
}

Converting the readabilityScores field to human-friendly formatting yields the following:

readability grades:
     Kincaid: 0.4
     ARI: 3.5
     Coleman-Liau: 5.9
     Flesch Index: 111.4/100
     Fog Index: 4.3
     Lix: 26.3 = below school year 5
     SMOG-Grading: 3.0
sentence info:
     399 characters
     96 words, average length 4.16 characters = 1.00 syllables
     9 sentences, average length 10.7 words
     22% (2) short sentences (at most 6 words)
     0% (0) long sentences (at least 21 words)
     1 paragraphs, average length 9.0 sentences
     0% (0) questions
     77% (7) passive sentences
     longest sent 17 wds at sent 9; shortest sent 6 wds at sent 4
word usage:
     verb types:
     to be (9) auxiliary (0) 
     types as % of total:
     conjunctions 2% (2) pronouns 23% (22) prepositions 12% (12)
     nominalizations 2% (2)
sentence beginnings:
     pronoun (6) interrogative pronoun (0) article (0)
     subordinating conjunction (0) conjunction (0) preposition (0)

Note how the Flesch Index here is 111.4 out of a possible maximum value of 100.

One of the reasons for creating this dataset is to enable greater research into how readability statistics can be adapted to spoken word transcripts like television news, especially as we move towards greater reliance on automated transcripts for which punctuation is assigned through language models that would have an outsized influence on sentence-based scores. We're excited to see what kinds of research this enables!