When it comes to translating television news into searchable text, both realtime closed captioning and cleaned post-broadcast rush transcripts are typed by humans, but closed captioning is typically all upper or lowercase with limited punctuation and typically marred by holes or typos, while cleaned rush transcripts have proper capitalization, correct attribution of speakers, all spoken text and punctuation that matches speaker cadence. How does this impact readability scores?
Below is the output of GNU style applied to CNN's rush transcript for CNN Newsroom December 1, 2020 from 9-10AM PST:
readability grades: Kincaid: 6.3 ARI: 7.0 Coleman-Liau: 8.5 Flesch Index: 76.4/100 Fog Index: 9.7 Lix: 33.8 = below school year 5 SMOG-Grading: 9.5 sentence info: 17629 characters 3948 words, average length 4.47 characters = 1.36 syllables 266 sentences, average length 14.8 words 46% (124) short sentences (at most 10 words) 18% (50) long sentences (at least 25 words) 77 paragraphs, average length 3.5 sentences 4% (13) questions 39% (104) passive sentences longest sent 66 wds at sent 258; shortest sent 1 wds at sent 18 word usage: verb types: to be (166) auxiliary (56) types as % of total: conjunctions 5% (206) pronouns 13% (497) prepositions 11% (449) nominalizations 1% (44) sentence beginnings: pronoun (69) interrogative pronoun (7) article (10) subordinating conjunction (3) conjunction (44) preposition (5)
And here are the results for the closed captioning of the broadcast (the captioning actually includes the full hour-long broadcast while the rush transcript is just for the first 30 minutes, so here the captioning was truncated to the same ending as the rush transcript):
readability grades: Kincaid: 1.8 ARI: 5.9 Coleman-Liau: 7.9 Flesch Index: 105.9/100 Fog Index: 5.6 Lix: 32.6 = below school year 5 SMOG-Grading: 4.9 sentence info: 18244 characters 4140 words, average length 4.41 characters = 1.04 syllables 314 sentences, average length 13.2 words 41% (130) short sentences (at most 8 words) 16% (53) long sentences (at least 23 words) 1 paragraphs, average length 314.0 sentences 3% (11) questions 38% (122) passive sentences longest sent 61 wds at sent 307; shortest sent 1 wds at sent 32 word usage: verb types: to be (160) auxiliary (66) types as % of total: conjunctions 4% (171) pronouns 12% (485) prepositions 11% (470) nominalizations 1% (54) sentence beginnings: pronoun (90) interrogative pronoun (13) article (22) subordinating conjunction (8) conjunction (18) preposition (12)
Immediately clear is that while the basic statistics of characters, words and sentences are relatively similar, the readability scores are dramatically different. One likely possibility is that the all-caps nature of the captioning may be skewing the internal algorithm used by GNU style, which may be more sensitive to proper capitalization in how it divides sentences and counts word syllables.