How Google's Speech To Text API Hears Television News From Around The World

With the release of the new EMEA Television News Archive on Monday, we wanted to explore how well state-of-the-art speech transcription (ASR) works on the rich diversity of languages, dialects and accents found in television news from around the world. Today we're releasing fully automated transcripts generated by Google's Speech to Text API for the two sample broadcasts for each EMEA channel from Monday's blog post: 18 broadcasts from 9 channels.

The Speech to Text (STT) API does not currently support natively extracting audio from video files, so  we used the MP4 files that we created yesterday for use with the Google Video AI API and extracted their audio channels to standalone audio-only FLAC files. Given that the STT API requires monoaural audio, to avoid correlation effects, we extracted only the left channel:

time find *.mp4 | parallel --eta 'ffmpeg -nostdin -hide_banner -loglevel panic -i ./{} -filter_complex "[0:a]channelsplit=channel_layout=stereo:channels=FL[left]" -map "[left]" -f flac ./{.}.flac'

We then submitted each video to the STT API using the following query:

curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" -H "x-goog-user-project: [YOURPROJECTID]" https://speech.googleapis.com/v1/speech:longrunningrecognize --data "{
  'config': {
    'encoding': 'FLAC',
    'languageCode': '[LANGCODE]',
    'enableWordTimeOffsets': true,
    'enableWordConfidence': true,
    'enableAutomaticPunctuation': true,
    'maxAlternatives': 30,
    'model': 'latest_long'
  },
  'audio': {
    'uri':'gs://[BUCKET]/[SHOW].flac'
  },
  'output_config': {
     'gcs_uri':'gs://[BUCKET]/[SHOW].stt.latest_long.json'
  }
}"

We use the "latest_long" model which uses the most recent available model that is tailored for long-form spoken word content and is roughly equivalent to the previous "video" model nomenclature. In cases where the API determines multiple possible transcriptions for a given utterance, we request up to 30 alternatives ordered by confidence. To avoid receiving a single massive blob of text, we enable automatic punctuation, which splits the text into sentences. We also ask the API to return the precise timestamp of each recognized word and its confidence in its recognition of that word.

You can see the alternative transcription process at work in the first transcribed sentence of Canal Algérie on June 19, 2022 at 6:00PM.:

  • Final Transcription (confidence: 0.92451394): "Heureusement le prix fort en rétrogradant en division inférieure." (Google Translate: "Fortunately the high price by demoting to the lower division.")
  • Alternative 1: (confidence: 0.90036309): "Heureusement le prix fort en rétrogradant en division inférieur." (Google Translate: "Fortunately the high price by demoting to the lower division.") (The difference is that "inférieure" becomes "inférieur").
  • Alternative 2: (confidence: 0.89144981): "Heureusement le prix fort en rétrogradant en division inférieures." (Google Translate: "Fortunately the high price by demoting to lower divisions.") (The difference is that "inférieure" becomes "inférieures").

The STT API does not currently support all of the dialects represented in this collection, so we chose the closest linguistic match. In the case of Sudan, STT does not currently support Sudanese Arabic, so we chose Egyptian Arabic, given that it is typically used as a proxy where Sudanese Arabic translation is not available. Unfortunately, STT does not currently support a reasonable language proxy for South Sudanese Arabic / Juba Arabic, so while we used STT's Egyptian Arabic model, the results may be poorer, depending on the precise mixture of Sudanese and Juba Arabic by speakers in each program.

A more complex challenge is the widespread use of code switching on several of the channels. In some cases, speakers in an interview or panel format each speak a different language or dialect or fluently alternate between languages. At this time, STT does not support code switching, so we selected the primary language/dialect spoken on that channel for each of the 19 sample broadcasts. Some broadcasts alternate segments in Arabic or French and English or other languages or represent a deviation from the primary language of that channel, but to replicate what a fully automated transcription workflow would look like in the absence of code switching support, we used only the primary channel language for each of the broadcasts.

The final selected language code used for each channel was:

  • Canal Algérie: fr-FR
  • Jordan TV: ar-JO
  • Palestinian Satellite Channel: ar-PS
  • RTP Internacional: pt-PT
  • Sharjah TV: ar-AE
  • Southern Sudan Television: ar-EG
  • Sudan State TV: ar-EG
  • Télé Congo: fr-FR

You can see the final results for each broadcast below.

Remember that these transcripts are the result of 100% automated processes, some use proxy languages that may be too dissimilar to yield adequate results and some channels feature extensive code switching or multiple spoken languages, so these results represent merely an initial demonstration showcase of the potential of automated transcription to make television news from around the world more accessible to scholarly research, from keyword search to NLP analysis.