Across our experiments with OpenAI's Whisper ASR this week, its unprecedented fluency has been challenged by its high non-determinism, with results varying substantially every time it runs and often interrupted by high levels of dropouts, repetition and hallucination.
Whisper's creators acknowledge these issues and in a previous Q&A noted that non-determinism "happens when the model is unsure about the output (according to the compression_ratio_threshold and logprob_threshold settings). The most common failure mode is that it falls into a repeat loop, where it likely triggers the compression_ratio_threshold. The default setting tries temperatures 0, 0.2, 0.4, 0.6, 0.8, 1.0 until it gives up, at which it is less likely to be in a repeat loop but is also less likely to be correct.
Their recommended solution was to add "–temperature_increment_on_fallback None" as a CLI parameter to stabilize the output. To test its efficacy, we ran all four models in both transcription and translation tasks, with the results below. While it did eliminate non-determinism, it came at the cost of eliminating more than half the broadcast, replacing it with repeated text.
- Tiny: Transcribe: Test 1
- Tiny: Transcribe: Test 2
- Tiny: Translate: Test 1
- Tiny: Translate: Test 2
- Small: Transcribe: Test 1
- Small: Transcribe: Test 2
- Small: Translate: Test 1
- Small: Translate: Test 2
- Medium: Transcribe: Test 1
- Medium: Transcribe: Test 2
- Medium: Translate: Test 1
- Medium: Translate: Test 2
- Large: Transcribe: Test 1
- Large: Transcribe: Test 2
- Large: Translate: Test 1
- Large: Translate: Test 2