Using Automated Speech Recognition To Precision Align Human Closed Captioning

Human-generated closed captioning remains the gold standard for live television news broadcasts, but the use of human transcriptionists to listen to the news and type up what they hear in realtime means there is necessarily a delay between the words spoken on the television and the words appearing in the closed captioning stream. Delays of several seconds are largely inconsequential for keyword captioning search, but create unique challenges in supporting hybrid visual-audio search in which the spoken word must be aligned to what was onscreen at the moment it was spoken.

Take NBC Nightly News the evening of March 7, 2011. Just after the conclusion of the broadcast, there is a brief overlap with the local KNTV news programming in which the phrase "scandal involving…" is spoken. In the closed captioning record for this broadcast, the word "scandal" is listed as being spoken 1,781 seconds after the start of the broadcast. Listen to the actual clip beginning at that moment and the word is missing, indicating it must have been spoken prior to that timecode offset.

In contrast, the automated transcripts generated by Google Cloud Video API's speech transcription "video" model list the word "scandal" being spoken straddling offsets 1,775 to 1,776 seconds, a full 6 seconds earlier. Listen to that clip and the word appears exactly where the model says it will.

The fact that the station-provided closed captioning can place a word a full 6 seconds after it was actually spoken makes it difficult to align such transcripts with the visual landscape that underlies them. Not to mention the fact that captioning often lumps several seconds of spoken words together into a single start position, making it hard for words that don't begin a captioning line to estimate where they were actually spoken (in this case "scandal" begins a captioning line so we have its precise caption-provided start time).

Thus, we have a subsecond-precise computer transcript that is majority correct with a few errors and misspelled names and a human-provided transcript that is traditionally held as a gold standard (though it tends to exclude complex introductions and guest titles and affiliations and typically excludes commercials), but has a varying timecode delay.

What if we merged these two transcripts, using the precisely aligned machine transcript to precision align the human transcript? This has the added benefit of allowing us to add timecode information to offline corrected transcripts delivered outside of the timecoded closed captioning stream.

While there are many approaches used in the literature, it turns out that there is an incredibly simplistic way of achieving this: the age-old "diff" process. We compute the diff alignment of the human transcript compared with the machine transcript, copying over the timecodes from the machine to the human transcripts and interpolating timecodes for deviating words in the human transcript. Given that the machine transcript is in all cases more complete (though more error-prone) than the human transcript, this allows us to apply precision subsecond timecodes to the human transcript with nothing more than a simple "diff". Best of all, even if the machine transcript has an unusually high level of error, a sufficient number of words will be recovered to allow precise alignment and the assistance of fuzzy word matching can close that precision gap even further.