A Behind-The-Scenes Look At The Visual Explorer's Fully Automated Transcription & Translation Workflow

Earlier today we announced that we have transcribed and translated all of the December 2022 broadcasts from 1TV, NTV, Russia 1, Russia 24, Espreso and Belarus24 monitored by the Internet Archive's Television News Archive. What does the pipeline powering this system look like? Despite its immense capability, the actual underlying technical pipeline is relatively simple, drawing together a number of preexisting building blocks for the final pipeline.

  • Ingest. First, the video must be ingested into GCS for processing. The video ingest node is a single-core N1 VM that continuously streams 33 HD video channels into GCS directly using streaming writes. This means the node consumes almost no local CPU or disk IO and could easily scale to ingesting many more channels.
  • Thumbnailing. Next, the video is converted from a linear 1D format into a 2D thumbnail grid to make it visually "skimmable." This is accomplished through an ffmpeg workflow that reads the video in streaming ingest mode, assembling the frames into local RAM disk for maximum-throughput IO and writing the final results back to a public GCS bucket.
  • Transcription. The audio channel of the video is extracted using ffmpeg into a FLAC surrogate file and processed using Google's Cloud Speech-to-Text API to generate a native-language precision transcript. The API-generated transcript is processed into blocks of words of roughly 50 characters each, becoming an SRT file and an internal captioning lookup for powering the Visual Explorer interface.
  • Translation. The native-language transcript from Cloud Speech-to-Text is translated using Google's Cloud Translation API using a special workflow that transparently passes timecode information through the API. This is then reassembled into a translated SRT and internal captioning lookup.
  • Display. Finally, the Visual Explorer in-browser interface renders the translated transcript down the right-hand side of the thumbnail gallery to make it possible to perform simultaneous visual+textual search. For example, a searcher looking for "nuclear" being mentioned only alongside imagery of nuclear weapons detonations would use this display to separate those instances from other mentions of "nuclear." When playing any video clip, the translated transcript for that 30-second clip is display beneath the video and onscreen closed captioning is displayed as a native browser overlay on top of the video using dynamically-cosntructed TextTracks constructed on-the-fly at playback time.

That's it!

You can explore the final results by looking at broadcasts from 1TV, NTV, Russia 1, Russia 24, Espreso and Belarus24 from December 2022 in the Visual Explorer!

View Translated Broadcasts In Visual Explorer.