Speaker Identification By Merging Captioning With Rush Transcripts

Human-generated closed captioning and machine-generated automatic speech recognition (ASR) both offer transcripts of television news programming, but while they capture what was said, they don't record who said it. In an exchange between a reporter and an interviewee or during a panel discussion, for example, there is no way of assigning who said what in most cases.

In the case of a single speaker, chyrons may be useful in identifying the speaker, using the name and affiliation displayed in the "lower third" section of the screen, but this doesn't help distinguish between multiple speakers or individuals who are routinely mentioned on television news outside of their own remarks (such as a president or Dr. Fauci).

Instead, one option is to merge the so-called "rush transcripts" (in the case of CNN, provided by LexisNexis and Factiva) provided by some television channels for selections of their programming, with the closed captioning or ASR transcript using the "diff" utility to align the two and copying the speaker identification provided in the rush transcript over to the captioning/ASR transcript.

For example, take the January 2, 2021 7AM PST broadcast of CNN Newsroom With Victor Blackwell and Christi Paul. During one of the exchanges, we see in the underlying closed captioning:

[000:03:55;685] >> THAT TWEET WAS A VERY BROAD
[000:03:58;254] CRITIQUE OF THE GOP, BUT I KNOW
[000:03:59;956] HE'S ACTUALLY TAKING AIM AT
[000:04:02;558] SEVERAL SPECIFIC GOP SENATORS,
[000:04:05;795] YES?
[000:04:06;062] >> Reporter: THAT'S RIGHT, AND
[000:04:06;796] ONE OF THEM IS ONE OF THE
[000:04:07;997] HIGHEST RANKING REPUBLICANS IN
[000:04:08;865] THE SENATE.
[000:04:10;800] THAT'S MAJORITY WHIP JOHN THUNE.

In this particular case, the transcript includes a special notation that the second line was spoken by a "Reporter" but in most cases such delineations are not used, meaning there would have been no way of telling that these two lines were spoken by two different people. Even in this case, there is absolutely no way of telling who spoke each of the two lines.

Instead, if we look at the CNN rush for that program we see for this exchange:

PAUL: Yes, Sarah, that tweet was a very broad critique of the GOP, but I know he's actually taking aim at several specific GOP senators, yes?
WESTWOOD: That's right, Christi. And one of them is one of the highest-ranking Republicans in the Senate. That's Majority Whip John Thune.

Since this exchange is not the first time the two had spoken in this broadcast they are identified only by their last names, but scrolling up in the rush we see their last names connected to their full names and affiliations in machine-friendly format:

CHRISTI PAUL, CNN ANCHOR
SARAH WESTWOOD, CNN WHITE HOUSE REPORTER

Thus, with relatively little work it would be possible to enrich closed captioning with speaker identification information simply by merging them with these rush transcripts.