Last week we unveiled the new TV News Visual Explorer that visualizes the Internet Archive Television News Archive's new Belarusian, Russian & Ukrainian TV News Archive. These channels all lack closed captioning, meaning they can be visually searched to identify major visual narratives, but for those users who don't speak Russian or Ukrainian, the lack of a transcript can make it difficult to understand what is being said beyond looking at imagery and facial expressions.
How might automated speech recognition (ASR) help unlock such archives and especially make them accessible down the road to non-native speakers through the power of machine translation? Google's Speech-to-Text API today supports more than 125 languages and variants, including both Russian and Ukrainian, meaning the tools exist to transcribe an entire broadcast archive at scale across languages.
As a first glimpse of what this rich AI-powered future might look like, we applied the GCP Speech-to-Text API to all Russian and Ukrainian language broadcasts in the TV Visual Explorer monitored by the Internet Archive's Television News Archive across the following channels from their start dates through the end of day UTC on June 6, 2022:
- 1TV: March 26, 2022 to June 6, 2022
- NTV: March 26, 2022 to June 6, 2022
- RUSSIA1: March 26, 2022 to June 6, 2022
- RUSSIA24: April 25, 2022 to June 6, 2022
- ESPRESO: April 25, 2022 to June 6, 2022
- BELARUSTV: May 16, 2022 to June 6, 2022 (Russian-language channel)
In all, 13,391 transcripts were machine generated through the Speech to Text API, representing the state of the art in multilingual machine speech recognition today.
Since the API is designed for audio files, we had to extract the audio from each broadcast. We used GSUTIL to stream the video through FFMPEG running on a GCE VM and streaming the FLAC file back to GCS in a stream processing configuration. Since the broadcast audio is stereo, merging the left and right channels into a combined mono channel could create challenges due to stereo anticorrelation, so here we extract just the left channel. The final pipeline is a single one-liner:
gsutil -q cat [BUCKET]/VIDEO.mp4 | ffmpeg -nostdin -threads 1 -hide_banner -loglevel panic -i - -filter_complex "[0:a]channelsplit=channel_layout=stereo:channels=FL[left]" -map "[left]" -f flac pipe: | gsutil -q cp - [BUCKET]/VIDEO.flac
We then submit each FLAC file to the Speech to Text API using the following configuration. We enable automatic punctuation and request word timecodes and confidence scores. We also request that the API provide alternatives where they exist. For long-format broadcast audio like this, the best model is the "latest_long" designed specifically for this kind of content.
For the Ukrainian broadcasts the language code was set to "uk-UA" and for the Russian broadcasts it was set to "ru-RU". The Belarusian channel BELARUSTV is a Russian language Belarusian channel and thus also uses the Russian model.
curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" https://speech.googleapis.com/v1/speech:longrunningrecognize --data "{ 'config': { 'encoding': 'FLAC', 'languageCode': '[LANGCODE]', 'enableWordTimeOffsets': true, 'enableWordConfidence': true, 'enableAutomaticPunctuation': true, 'maxAlternatives': 30, 'model': 'latest_long' }, 'audio': { 'uri':'[BUCKET]/VIDEO.flac' }, 'output_config': { 'gcs_uri':'[BUCKET]/VIDEO.stt.latest_long.json' } }"
You can see the first few lines of the transcript for one of the broadcasts (ESPRESO_20220606_220000) (see in TV Visual Explorer) below showing what the various fields look like. The API breaks the transcript into blocks of sentences, each with a combined "transcript" field with associated confidence and then the individual words with their respective confidence scores and start/end times.
{ "results": [ { "alternatives": [ { "confidence": 0.91855603, "transcript": "так має свою волю і може виражати і ніякі Ці автомати нічого цього не зможе з цим зробити зараз планується такий тур по Україні для підтримку ЗСУ тобто я всі забуду заробляти гроші і всі які зароблені гроші я буду тривати наших хлопцям щоб з вами були руські виникла рани напередодні сім'я горбовських прибула в евакуаційним потягом до івано-франківська І тепер обживається в одному з населених пунктів Прикарпаття а самці уже будує оптимістичні плани на майбутнє", "words": [ { "confidence": 0.92401659, "endTime": "0.100s", "startTime": "0s", "word": "так" }, { "confidence": 0.92401659, "endTime": "0.900s", "startTime": "0.100s", "word": "має" }, { "confidence": 0.92401659, "endTime": "1.200s", "startTime": "0.900s", "word": "свою" }, { "confidence": 0.92401659, "endTime": "1.500s", "startTime": "1.200s", "word": "волю" }, { "confidence": 0.92401659, "endTime": "1.800s", "startTime": "1.500s", "word": "і" }, ...
In the coming weeks we will be computing these transcripts in realtime and integrating them into the TV Visual Explorer interface, but in the meantime we've released the entire archive of 13,391 transcripts as a downloadable ZIP file to jumpstart research into the spoken word narratives of the war.
The Speech To Text API computes word offsets relative to the start of the video file. Thus, in the example above, the word "так" has a start time of 0s and an end time of 0.1s. To listen to each respective word/sentence in the TV Visual Explorer, you'll need to convert from the relative offsets in the API JSON transcript into a Unix timestamp (seconds since the epoch) by adding each offset to the UTC start time of the show from its Internet Archive identifier. For example, in this case the show identifier is "ESPRESO_20220606_220000" which is parsed as CHANNEL_YYYYMMDD_HHMMSS in UTC time, meaning the broadcast aired at 22:00:00 hours (10PM) on June 6, 2022 in UTC time, which corresponds to a Unix timestamp of 1654552800 as the start time of the broadcast. This is the base timestamp of the broadcast you'll add each of the JSON "startTime" offsets to. The Visual Explorer only supports non-fractional seconds, so you'll need to round down to the nearest integer.
So, to play a clip of the broadcast starting with the word "волю" we can see from the transcript above that it starts at 1.2 seconds, which we round down to 1 second. Then add that to the broadcast's base timestamp of 1654552800, yielding 1654552800 + 1 = 1654552801.
This is the timestamp to append to the Visual Explorer URL for the broadcast with the "&play" parameter:
- https://api.gdeltproject.org/api/v2/tvv/tvv?id=ESPRESO_20220606_220000&play=1654552801
We are incredibly excited about the fundamentally new kinds of research these transcripts will enable. Remember that these transcripts are 100% automated and thus will contain a certain level of error. In cases of unclear audio or ambiguous pronunciation, there may be multiple slightly different "transcript" blocks in the JSON with their corresponding confidence scores that can be used with downstream external or contextual information for resolution.
Since these transcripts are completely automated with no human correction of any kind, you should use them only as a guide to the underlying content and rely on human native speakers for verification and for reporting and commentary where absolute accuracy without error is required.
Download The Full Belarusian, Russian & Ukrainian TV News Archive Speech Recognition Transcript Collection. (ZIP: 738MB / 8.3GB)