Comparing Google Cloud Video API's Speech Recognition To Human Captioning Over TV News 2010-2020

Earlier this year we compared the performance of Google Cloud Video API's ASR speech recognition over the past decade with the human captioning provided by each network, finding greater than 90% agreement. How does this compare to the accuracy of 24/7 news channels like BBC News, CNN, MSNBC and Fox News?

In collaboration with the Media-Data Research Consortium (M-DRC), we've processed BBC News, CNN, MSNBC and Fox News 24/7 since the start of this year, along with selections of CNN, MSNBC and Fox News over the past decade during a set of major disease outbreaks. While our 2009-2019 comparison is based on just small selections of the channels over that period, it nonetheless allows us to examine macro-level trends in how human and machine captioning align.

Note that the results for BBC News London will be slightly different from the other channels. For BBC News, the onscreen captioning text is OCR'd and converted to ASCII text, whereas for the other channels, the closed captioning stream is read directly as ASCII text. This OCR process introduces additional error.

As before, we used the UNIX "diff" tool to compare the human and machine transcripts. Due to closed captioning typically excluding commercials, we ignored blocks of text that appeared only in the machine transcript. The final results can be seen in the barchart below, with one score reporting the average results over 2020 and the other reporting the results from 2009-2020. The Y axis reports the percent agreement between human and machine transcripts.

All seven channels have nearly identical scores except for a slightly lower score for BBC News due likely to the OCR error of having to transcribe its captioning from onscreen text rather than having it in born digital format.

The timeline below plots all of these scores over time (all ASR was performed in 2020, so the X axis reflects the age of the program material, not when ASR was performed). Remember that BBC News has been analyzed only since the start of this year and that CNN, MSNBC and Fox News include only selections covering major disease outbreaks from 2009-2019. Overall, little major change can be seen over this period, suggesting the ASR agreement has remained remarkably consistent over myriad newsreaders, hosts, guest interviewees, onsite reporters, press conference speakers and technical changes over the past decade.

Put together, these graphs remind us that ASR has reached the point today that it can achieve near-human levels of accuracy on broadcast television news. Remember that many of the differences reflected in these graphs are actually errors in the human captioning, rather than the machine transcripts, suggesting that overall ASR has reached a point where it is highly competitive to realtime human captioning.