The GDELT Project

Comparing Google Cloud Video API's Speech Recognition To Human Captioning Over TV News 2010-2020: Deep Dive

For those interested in deep diving into the comparison we released earlier today between Google Cloud Video API's speech recognition and human closed captioning, it is interesting to explore which shows have the greatest deviation between human and machine and which have the least. Interestingly, manual review of a number of shows with substantial deviations shows that the machine actually outperforms the human in most of those cases, often due to the human leaving out random words and often omitting entire substantive passages that the machine faithfully transcribes in their entirety.

Take CNN's January 5, 2020 4PM PST "The Impeachment of Donald J. Trump" broadcast. At the 4:04PM PST mark the following passage appears in the human-created captioning:

"the white house blocked witnesses we know that have been discovered by journalists who had direct knowledge"

In contrast, the machine captioning is twice as long and even captures the speaking uttering the word "from" twice in sequence:

"the white house blocked from the house portion this witnesses who we know from from emails that have been discovered by journalists as opposed to by the house inquiry in the last couple of weeks you know had direct knowledge"

If you listen to the passage carefully you'll note that the machine transcription is flawless, whereas the human transcript omits substantial detail. This suggests that machine transcriptioning may have especial utility for fast-paced speakers and situations where there is a lot of fine detail being uttered in rapid succession that can quickly overwhelm a human.

You can see for yourself how closely the human and machine transcripts were for each of the examined broadcasts by downloading the tab-delimited spreadsheet below, with one row per video, reflecting the statistics of that particular show:

The file is tab limited with the following fields: