How do the automated television news transcripts generated by Google's Cloud Video API compare with the human-produced captioning viewers see? While we'll be exploring these differences at scale in more detail over the coming weeks, a closer look at NBC Nightly News the evening of March 7, 2011 offers a few hints about the kinds of systematic differences we see at scale.
Common sources of small-scale differences scattered throughout transcripts include:
- Numbers: Google's ASR typically writes short numbers out, while human transcripts use numbers. Thus, examples abound of "five dollars" versus "$5". This can be slightly complicated when newsreaders speak "three fifty-two a gallon" which Google transcribes as "352 a gallon" while in reality it was a dollar amount (though the announcer did not mention dollars) that the human caption wrote as "$3.52 a gallon". Given that the announcer never said dollars and never announced a decimal point, the ASR transcribed the number as it was spoken, while the human captioner knew that in the given context the number referred to a dollar amount.
- Transliterations. There are often multiple widely used transliterations of names originating in non-Latin scripts. In the case of Libya's former leader, Google's ASR writes "Gaddafi" (used in Wikipedia's English transliteration) while the human transcript used "Gadhafi." Such transliteration differences are a common source of divergence.
- Compound Words. Compound words are a common source of divergence, with ASR choosing "counter-attacks" while the human wrote "counterattacks", the ASR choosing "sky-high" while the human chose "sky high" and so on.
- Plurals And Tense. On occasion the ASR misses the plural invocation of a word, but most commonly the ASR correctly writes the plural form of a word while the human transcript contains the singular. For example, "warplanes" (ASR) versus "warplane" (human). In other cases the machine gets things wrong when words blend together like "concerned that" (ASR) versus "concern that" (human) and "you seen" (ASR) versus "you're seeing" (human).
- Similar-Sounding Words. Counterintuitively, it is often the human that gets these wrong, rather than the ASR since the ASR relies on surrounding context to disambiguate the word. For example, the ASR correctly transcribed "price swings" versus the human's "price wings".
Perhaps the single greatest source of difference between the machine and human transcripts are the following two classes:
- Commercials. A percentage of commercials are not captioned, meaning these appear as blank periods in the human transcripts, while the machine transcribes them, creating large chunks of deviating airtime.
- Shortcuts. Human transcriptionists are under immense time pressure and frequently exclude titles, affiliations and other lengthy introductions and often skip connecting words. Examples include "supplies were meant" (ASR) versus "supplies meant" (human), "taxi riders in this city are now being" (ASR) versus "taxi riders are being" (human), "us ambassador to nato says they" (ASR) versus "us ambassador said they" (human), "but then he seemed to downplay the" (ASR) versus "but downplayed the" (human) and so on. The ASR also faithfully transcripts repeated words and the exact words the speaker said as in "pretty frustrating to have to spend like sixty sixty-five dollars" (ASR) versus "really frustrated i've just spent like $65" (human).