A Deep Dive Exploration Applying OpenAI's Whisper ASR To A French-Language Télé Congo TV News Broadcast

Thus far this week we have conducted deep dives applying OpenAI's Whisper ASR to Russian and English-language television news broadcasts. Today we apply it to a French-language broadcast from the Republic of Congo's national television channel Télé Congo from January 1, 2012 9:01PM WAT. This 30-minute news broadcast offers a range of speakers, the majority of whom speak in outdoor environments with substantial background noise, captured by several different microphone and recording devices, offering an ideal test of Whisper's ability to process audio outside the controlled confines of the studio environment. Given that French has Whisper's 9th-smallest word error rate (WER), this broadcast also pits Whisper's traditional accuracy on French content with the challenges of the field-captured audio that defines this broadcast.

Key findings are that Whisper's non-deterministic output and unpredictable artifacts can yield highly dissimilar outputs between runs, with outputs from some runs being so topically different that they almost seem to be from a different broadcast. It is not just translation that has high levels of difference between runs – transcription can also suffer from existential artifacting, with one Small transcription run devolving into repetition for 5 full minutes out of the 30 minute broadcast. While the Large model produced more fluent output, it also suffered from several areas of strong topical translation mismatch, in which the same airtime was translated as very different passages bearing little resemblance to one another.

The hardware and software environment used to run Whisper here is identical to what we used for our previous deep dives. The only difference in this case is that instead of first converting the broadcast to a single-channel FLAC file, we ran Whisper directly on the MPEG4 file to simplify the workflow:

time whisper ./TELECONGO_20120101_200100.mp4 --model small > TELECONGO_20120101_200100.small.transcribe.txt
time whisper ./TELECONGO_20120101_200100.mp4 --model small --task translate > TELECONGO_20120101_200100.small.translate.txt

In our first test of Whisper on the broadcast using the Small model with the transcription task, Whisper entered a repetition state around 30 seconds into the broadcast and continued to repeat the same statements for the next five and a half minutes. This is an important finding in that it means that even transcription of one of Whisper's most accurate languages can suffer from unpredictable and existential repetition artifacts.

Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: French
[00:00.000 --> 00:05.360]  Vraiment dans la joie de voir parce que quand on partait dans les trains, il y avait des
[00:05.360 --> 00:06.360]  tracas serrés.
[00:06.360 --> 00:11.440]  Quand on a beaucoup de voies, les gens peuvent surclé vraiment dans la tranquillité.
[00:11.440 --> 00:15.200]  Si tu as un moyen, tu peux prendre un moyen à Brasaville, tu arrives, c'est une bonne
[00:15.200 --> 00:16.200]  chose.
[00:16.200 --> 00:21.560]  Vraiment, c'est parce que dans les villages un peu enclavés, les gens ont vraiment des
[00:21.560 --> 00:27.200]  difficultés de venir vendre ici à Pointe-Nord, mais maintenant quand il y a les routes,
[00:27.200 --> 00:32.080]  il y a la facilité, il y a la facilité de rentrer la marchandise soit à Pointe-Nord,
[00:32.080 --> 00:34.960]  soit à Brasaville, c'est une bonne chose pour les routes.
[00:34.960 --> 00:37.200]  On a besoin de routes, qu'on voit c'est ça.
[00:37.200 --> 00:57.200]  On a besoin de routes, on a besoin de routes, on a besoin de routes à Pointe-Nord et à Brasaville.
[00:57.200 --> 01:11.200]  On a besoin de routes, on a besoin de routes, on a besoin de routes à Pointe-Nord et à Brasaville.
[01:11.200 --> 01:28.200]  On a besoin de routes, on a besoin de routes, on a besoin de routes à Pointe-Nord et à Brasaville.
[01:28.200 --> 01:52.200]  On a besoin de routes, on a besoin de routes, on a besoin de routes à Pointe-Nord et à Brasaville.
[01:52.200 --> 02:20.200]  On a besoin de routes, on a besoin de routes, on a besoin de routes à Pointe-Nord et à Brasaville.
[02:20.200 --> 02:49.200]  On a besoin de routes, on a besoin de routes, on a besoin de routes à Pointe-Nord et à Brasaville.
[02:49.200 --> 03:09.200]  On a besoin de routes, on a besoin de routes à Pointe-Nord et à Brasaville.
[03:09.200 --> 03:18.200]  On a besoin de routes, on a besoin de routes à Pointe-Nord et à Brasaville.
[03:18.200 --> 03:47.200]  On a besoin de routes, on a besoin de routes à Pointe-Nord et à Brasaville.
[03:47.200 --> 04:16.200]  On a besoin de routes à Pointe-Nord et à Brasaville.
[04:16.200 --> 04:45.200]  On a besoin de routes, on a besoin de routes à Pointe-Nord et à Brasaville.
[04:45.200 --> 05:04.200]  On a besoin de routes à Pointe-Nord et à Brasaville.
[05:04.200 --> 05:28.200]  On a besoin de routes à Pointe-Nord et à Brasaville.
[05:28.200 --> 05:57.200]  On a besoin de routes à Pointe-Nord et à Brasaville.
[05:57.200 --> 06:16.200]  Le plus grand obstacle, ce fut une grosse pierre informe qui taillait pour les besoins de la cause, se recomposait immédiatement au matin du jour suivant, obligeant les travailleurs à reprendre plusieurs fois et plusieurs jours la même opération, comme Vénélope, qui sont s'atoile ou scisif, et ce qu'à la danse à Pointe.

As with our previous experiments, Whisper exhibits strongly non-deterministic output across all model sizes and tasks. Even the Large model can yield completely unrelated outputs between passes. Below is the same 20 second time period as translated by Whisper's Large model in two different runs. In all, 1,036 out of 4,786 (21.65%) of the words in the translation changed between the two runs. In this example, the two translations bear no resemblance to one another and share nothing other than a singular mention of Pointe-en-Arc. Topically, they would appear to be from entirely different broadcasts. Yet, they were generated by Whisper simply by running it on the same file twice in a row.

First run:

[00:52.000 --> 01:01.000]  Last year at the hospital, I took his sugar cane and sold it on the way.
[01:01.000 --> 01:06.000]  If I have a вообще, I would at least walk with him to Pointe-en-Arc.
[01:09.000 --> 01:15.000]  If the other day, he stopped after he had taken my case.

Second run:

[00:52.000 --> 00:58.000]  I think that in Ndjila, there is peace.
[00:58.000 --> 01:01.000]  In the villages, there are a lot of trucks.
[01:01.000 --> 01:06.000]  If you have a means, you can take it in Pointe-en-Arc.
[01:09.000 --> 01:12.000]  In Ndjila, there is peace.
[01:12.000 --> 01:15.000]  There is a lot of peace.
[01:15.000 --> 01:18.000]  In Ndjila, there is peace.
[01:18.000 --> 01:23.000]  I don't want to go back to Nanzami, I don't want to go back to Nanzami.
[01:24.000 --> 01:28.000]  I don't want to go back to Nanzami, I don't want to go back to Nanzami.

To help assess the impacts of non-determinism and the prevalence of various kinds of transcription and translation artifacts, the broadcast was run through both transcription and translation tasks for each of the four models four times in a row. You can view each of the outputs and the pairwise "diff" comparisons below:

Tiny Transcribe

Tiny Translate

Small Transcribe

Small Translate

Medium Transcribe

Medium Translate

Large Transcribe

Large Translate