Comparing Television News Readability Between Transcripts And Closed Captioning

When it comes to translating television news into searchable text, both realtime closed captioning and cleaned post-broadcast rush transcripts are typed by humans, but closed captioning is typically all upper or lowercase with limited punctuation and typically marred by holes or typos, while cleaned rush transcripts have proper capitalization, correct attribution of speakers, all spoken text and punctuation that matches speaker cadence. How does this impact readability scores?

Below is the output of GNU style applied to CNN's rush transcript for CNN Newsroom December 1, 2020 from 9-10AM PST:

readability grades:
     Kincaid: 6.3
     ARI: 7.0
     Coleman-Liau: 8.5
     Flesch Index: 76.4/100
     Fog Index: 9.7
     Lix: 33.8 = below school year 5
     SMOG-Grading: 9.5
sentence info:
     17629 characters
     3948 words, average length 4.47 characters = 1.36 syllables
     266 sentences, average length 14.8 words
     46% (124) short sentences (at most 10 words)
     18% (50) long sentences (at least 25 words)
     77 paragraphs, average length 3.5 sentences
     4% (13) questions
     39% (104) passive sentences
     longest sent 66 wds at sent 258; shortest sent 1 wds at sent 18
word usage:
     verb types:
     to be (166) auxiliary (56)
     types as % of total:
     conjunctions 5% (206) pronouns 13% (497) prepositions 11% (449)
     nominalizations 1% (44)
sentence beginnings:
     pronoun (69) interrogative pronoun (7) article (10)
     subordinating conjunction (3) conjunction (44) preposition (5)

And here are the results for the closed captioning of the broadcast (the captioning actually includes the full hour-long broadcast while the rush transcript is just for the first 30 minutes, so here the captioning was truncated to the same ending as the rush transcript):

readability grades:
     Kincaid: 1.8
     ARI: 5.9
     Coleman-Liau: 7.9
     Flesch Index: 105.9/100
     Fog Index: 5.6
     Lix: 32.6 = below school year 5
     SMOG-Grading: 4.9
sentence info:
     18244 characters
     4140 words, average length 4.41 characters = 1.04 syllables
     314 sentences, average length 13.2 words
     41% (130) short sentences (at most 8 words)
     16% (53) long sentences (at least 23 words)
     1 paragraphs, average length 314.0 sentences
     3% (11) questions
     38% (122) passive sentences
     longest sent 61 wds at sent 307; shortest sent 1 wds at sent 32
word usage:
     verb types:
     to be (160) auxiliary (66)
     types as % of total:
     conjunctions 4% (171) pronouns 12% (485) prepositions 11% (470)
     nominalizations 1% (54)
sentence beginnings:
     pronoun (90) interrogative pronoun (13) article (22)
     subordinating conjunction (8) conjunction (18) preposition (12)

Immediately clear is that while the basic statistics of characters, words and sentences are relatively similar, the readability scores are dramatically different. One likely possibility is that the all-caps nature of the captioning may be skewing the internal algorithm used by GNU style, which may be more sensitive to proper capitalization in how it divides sentences and counts word syllables.

The GDELT Project

Comparing Television News Readability Between Transcripts And Closed Captioning

Archives