AI Challenges: Speech Recognition In Multilingual Societies: From Multiple Languages To Code Switching

Earlier today we released an early experiment applying Google's Speech-to-Text API's speech recognition to the 18 sample broadcasts from our new EMEA TV News Archive collaboration with the Internet Archive's Television News Archive. One of the most fascinating findings from this experiment was the degree to which television news channels in highly multilingual societies utilize multiple languages in their broadcasts. This marks a sharp departure from simultaneous translation overdubbing favored by English language American television news broadcasters and represents an incredibly fascinating area of underdeveloped research in neural speech recognition. Indeed, such explorations are a hallmark of GDELT's work to globalize the field of analytics and AI, in broadening awareness of the rich diversity of news content around the world.

We've identified four core challenges that we'd love to see solutions for:

  • Language Specific Shows. Some channels broadcast different shows in different languages, each targeted to a different language speaking swath of society. A French show might be followed by an Arabic show and then an English show, all on that same channel. While we currently lack EPG program information for most of these channels, the availability of EPG information would allow us to utilize per-show language defaults. However, the lack of robust automatic spoken language detection that spans the full range of languages in our archive means it would be difficult to automatically set the primary language of new shows as they debut.
  • Sequential Translation. In place of the simultaneous translation overdubbing favored by English US broadcasters, some channels will feature a statement in one language and then feature the same statement repeated in a second or third language in sequence. This is currently outside the capabilities of SOTA ASR systems, but over time could become more tractable, since the broadcast could be segmented into language-specific contiguous spans.
  • Multiple Speakers Multiple Languages. Some shows feature multiple speakers that each speak one language for the duration of the show. For example, a panel might feature a French and an Arabic speaker, each of which only speaks in that language, but the end result is that two languages are spoken interspersed in the same segment. As the quality of automated speaker diarization continues to improve, one approach to this problem would be to divide a broadcast by speaker, determine the primary language of each speaker and separately transcribe each. This still requires robust all-languages language detection that is not yet widely available, but over time will become more accessible and robust.
  • Code Switching. The most difficult challenge of all is code switching, in which the same speaker alternates between two or more languages. This can take the form of alternating sentences, but more typically involves nearly seamless integration of individual words and word spans. No current production SOTA ASR system robustly supports fully automated transcription of code switched audio. Speaker diarization is not by itself a solution, since each speaker can speak in multiple languages. The same utterance can even have different meanings in the code switched languages, requiring that solutions take into account the local and global context of each utterance, rather than simply performing localized language detection on each utterance span.