Visual Explorer: Continuous Automatic Transcription Of Belarusian, Russian And Ukrainian TV News

This past June we released a compilation of automatically generated transcripts for the Visual Explorer's Belarusian, Russian and Ukrainian television news channels generated by Google's Speech-to-Text (STT) API. Since the Internet Archive's Television News Archive began monitoring the six channels in its archive earlier this year, we have been transcribing all of their shows through the STT API in a daily batch process. As of today we are now performing continuous automatic transcription of the totality of the airtime of those six channels.

As soon as a broadcast completes, the Internet Archive processes it into its archive, performing a number of tasks such as deriving the source video into an MPEG4 surrogate. Our television processing pipeline currently operates on a 10 minute heartbeat. Within 10 minutes of the Archive completing processing of a broadcast from these six channels, we ingest the MPEG4 file and within 10 minutes after that we run it through the STT API, which takes around 1-2 minutes for a typical broadcast, which is then picked up in our next 10 minute heartbeat and processed into a final consumable transcript ready for display in the Visual Explorer. Thus, 30 minutes after the Archive completes processing of a video, we have it transcribed and ready for display.

Work is already underway to integrate these transcripts into the Visual Explorer, where Google Chrome's built-in "Translate into English" Google Translate integration translates them into English as you can see in the example image at the top of this post. We will be rolling this out across all broadcasts from these six channels in the Visual Explorer in the coming weeks, so stay tuned!