Earlier this week we explored how OpenAI's Whisper ASR translated a Russian-language television news broadcast into English. Towards the middle of the blog post we compared Whisper's English translation of the first five minutes of the broadcast through the eyes of its Tiny, Small, Medium and Large models.
One specific difference between the results from each model illustrates the challenges of automated combined ASR and translation. While the Tiny model's output was too noisy to capture much of the meaning of the broadcast, the Small model captures its general gist and includes this translation:
[03:30.000 --> 03:33.000] Yes, this is the answer to the NATO threat to Putin. [03:33.000 --> 03:50.000] This is the answer to the NATO threat to Putin.
Yet, the Medium model translates that same statement as:
[03:30.000 --> 03:33.000] Yes, this is the response of NATO to the threats of Putin. [03:33.000 --> 03:34.000] Putin's threats.
While the Large model translates it as the singular:
[03:30.000 --> 03:52.000] Yes, this is the answer of NATO to Putin's threats.
Look closely, and the Small model describes NATO as threatening Putin, while the Medium and Large models describe Putin threatening NATO, with the Medium model repeating the statement for additional clarity.
Therein lies one of the challenges in using combined machine transcription and translation with smaller model sizes: the meaning of a statement can completely invert when using smaller model sizes, suggesting especial caution is required when balancing accuracy with reducing inference time and resource consumption through smaller models.