Transcribing 2.5M Hours Of TV News: A First Look At The 139-Language Linguistic Breakdown Of Two Years Of A Russian TV News Channel

As we continue to apply CLD2 language detection across the more than 2.5 million hours of machine transcribed global television news coverage from the Internet Archive's TV News Archive that we finished transcribing earlier this month, below is a first look at the 139 languages detected by CLD2 in the two years of Russia 1 that the Archive has monitored. The Archive began monitoring Russia 1 on March 26, 2022, just over a month after the Russian invasion of Ukraine, meaning that there is now nearly two complete years of coverage that we can use to understand the linguistic landscape of Russian television news. Recall that we used Google's Chirp LSM ASR model to machine transcribe the complete archive and then applied the CLD2 language detector to the resulting transcript archive.

The end result is that over the last two years, just over 97.9M words have been spoken on Russia 1 totaling 643M characters yielding 1.16B bytes (since Cyrillic is a multibyte characterset) across 14,629 distinct broadcasts. Of those words, 98.57% were in Russian, yielding the unsurprisingly conclusion that Russian broadcasts are fairly monolingual. Yet, our CLD2 language detection flagged a total of 139 distinct languages spoken amongst all of that content, ranging from a single word to a few hundred or thousand words for most. Interestingly, while English accounts for just 0.16% of all spoken words, 82.23% of all broadcasts contain at least one English word, demonstrating how common English loanwords are, especially for the names of global leaders and organizations.

What accounts for the vast long tail of languages seen below? Many of these are simply CLD2 recognition errors of loanwords – the result of one or two words of English or another language appearing in the midst of a block of Russian. A single English word representing a first name or a location can easily cause a language misidentification, given that it can be nearly impossible to robustly identify the language of a single isolated word (especially for words that appear in many languages). For example, in the transcript "Президент Macron объявил", should "Macron" be labeled as English or French? In some cases, these loanwords are legitimately the name of a company or government official from a given country, yielding a correct, if misleading, language identification of that country's dominate language. Indeed, language misidentification (errors) and isolated proper names (correct, but misleading) account for many of the long-tail identifications below. At the same time, Chirp, like all Large Speech Models (LSMs) can falsely randomly insert isolated characters or entire words from unrelated languages in the midst of a transcription, meaning that some recognized languages below are correct identifications of languages that were incorrectly transcribed by Chirp. Finally, and most fascinatingly, some of the most unexpected languages below actually represent accurate language detection of accurate transcriptions that appear highly unexpectedly, from brief clips of global events to excerpts of foreign language content.

RUSSIAN: 96545798 (98.57%) words / 633547459 (98.51%) chars / 1142999293 (98.61%) bytes / 14628 (99.99%) shows
Unknown: 906777 (0.93%) words / 6612530 (1.03%) chars / 12009182 (1.04%) bytes / 12813 (87.59%) shows
SERBIAN: 162909 (0.17%) words / 1015512 (0.16%) chars / 1781750 (0.15%) bytes / 8274 (56.56%) shows
ENGLISH: 160576 (0.16%) words / 901489 (0.14%) chars / 921255 (0.08%) bytes / 12029 (82.23%) shows
UKRAINIAN: 32507 (0.03%) words / 203274 (0.03%) chars / 357812 (0.03%) bytes / 1736 (11.87%) shows
GERMAN: 10428 (0.01%) words / 66164 (0.01%) chars / 67162 (0.01%) bytes / 697 (4.76%) shows
LATIN: 6507 (0.01%) words / 63431 (0.01%) chars / 63890 (0.01%) bytes / 3400 (23.24%) shows
BULGARIAN: 8017 (0.01%) words / 54304 (0.01%) chars / 89534 (0.01%) bytes / 1424 (9.73%) shows
BELARUSIAN: 8125 (0.01%) words / 52714 (0.01%) chars / 92487 (0.01%) bytes / 952 (6.51%) shows
ITALIAN: 7827 (0.01%) words / 44936 (0.01%) chars / 45564 (0.00%) bytes / 305 (2.08%) shows
DANISH: 5519 (0.01%) words / 38842 (0.01%) chars / 39471 (0.00%) bytes / 2562 (17.51%) shows
FRENCH: 5649 (0.01%) words / 31851 (0.00%) chars / 32624 (0.00%) bytes / 452 (3.09%) shows
SPANISH: 4225 (0.00%) words / 23271 (0.00%) chars / 23865 (0.00%) bytes / 318 (2.17%) shows
CZECH: 3615 (0.00%) words / 20505 (0.00%) chars / 20923 (0.00%) bytes / 999 (6.83%) shows
ROMANIAN: 2397 (0.00%) words / 16659 (0.00%) chars / 23431 (0.00%) bytes / 881 (6.02%) shows
MACEDONIAN: 2286 (0.00%) words / 16412 (0.00%) chars / 26943 (0.00%) bytes / 974 (6.66%) shows
BASQUE: 2264 (0.00%) words / 16158 (0.00%) chars / 17006 (0.00%) bytes / 1375 (9.40%) shows
DUTCH: 3447 (0.00%) words / 15882 (0.00%) chars / 16341 (0.00%) bytes / 2239 (15.31%) shows
NORWEGIAN_N: 2204 (0.00%) words / 13505 (0.00%) chars / 13941 (0.00%) bytes / 765 (5.23%) shows
INDONESIAN: 2385 (0.00%) words / 13190 (0.00%) chars / 13481 (0.00%) bytes / 1334 (9.12%) shows
SCOTS: 1578 (0.00%) words / 11663 (0.00%) chars / 11762 (0.00%) bytes / 815 (5.57%) shows
SCOTS_GAELIC: 2572 (0.00%) words / 11177 (0.00%) chars / 11331 (0.00%) bytes / 780 (5.33%) shows
ARABIC: 2020 (0.00%) words / 10920 (0.00%) chars / 19732 (0.00%) bytes / 30 (0.21%) shows
MAURITIAN_CREOLE: 1256 (0.00%) words / 10826 (0.00%) chars / 11438 (0.00%) bytes / 456 (3.12%) shows
FAROESE: 1550 (0.00%) words / 10659 (0.00%) chars / 10983 (0.00%) bytes / 463 (3.16%) shows
PORTUGUESE: 2088 (0.00%) words / 10425 (0.00%) chars / 10902 (0.00%) bytes / 662 (4.53%) shows
UZBEK: 1365 (0.00%) words / 9761 (0.00%) chars / 13166 (0.00%) bytes / 647 (4.42%) shows
TAJIK: 1386 (0.00%) words / 9493 (0.00%) chars / 14705 (0.00%) bytes / 465 (3.18%) shows
GALICIAN: 1795 (0.00%) words / 9080 (0.00%) chars / 9246 (0.00%) bytes / 1120 (7.66%) shows
X_PIG_LATIN: 1138 (0.00%) words / 8258 (0.00%) chars / 8343 (0.00%) bytes / 674 (4.61%) shows
LATVIAN: 1160 (0.00%) words / 8175 (0.00%) chars / 8352 (0.00%) bytes / 672 (4.59%) shows
ESTONIAN: 1165 (0.00%) words / 7720 (0.00%) chars / 8455 (0.00%) bytes / 354 (2.42%) shows
BASHKIR: 999 (0.00%) words / 7611 (0.00%) chars / 12588 (0.00%) bytes / 500 (3.42%) shows
GEORGIAN: 1022 (0.00%) words / 7214 (0.00%) chars / 18884 (0.00%) bytes / 41 (0.28%) shows
NORWEGIAN: 1056 (0.00%) words / 6823 (0.00%) chars / 7058 (0.00%) bytes / 671 (4.59%) shows
BRETON: 1192 (0.00%) words / 6727 (0.00%) chars / 6837 (0.00%) bytes / 783 (5.35%) shows
LUXEMBOURGISH: 1113 (0.00%) words / 6680 (0.00%) chars / 6856 (0.00%) bytes / 861 (5.89%) shows
VOLAPUK: 1596 (0.00%) words / 6521 (0.00%) chars / 6588 (0.00%) bytes / 487 (3.33%) shows
IRISH: 1704 (0.00%) words / 6407 (0.00%) chars / 6421 (0.00%) bytes / 1077 (7.36%) shows
KHASI: 1159 (0.00%) words / 6191 (0.00%) chars / 6217 (0.00%) bytes / 517 (3.53%) shows
MONGOLIAN: 920 (0.00%) words / 5943 (0.00%) chars / 9052 (0.00%) bytes / 414 (2.83%) shows
TATAR: 937 (0.00%) words / 5737 (0.00%) chars / 7787 (0.00%) bytes / 523 (3.58%) shows
TURKISH: 962 (0.00%) words / 5687 (0.00%) chars / 6160 (0.00%) bytes / 417 (2.85%) shows
HUNGARIAN: 746 (0.00%) words / 5475 (0.00%) chars / 5522 (0.00%) bytes / 435 (2.97%) shows
SANSKRIT: 556 (0.00%) words / 5276 (0.00%) chars / 5302 (0.00%) bytes / 321 (2.19%) shows
WELSH: 779 (0.00%) words / 5235 (0.00%) chars / 5464 (0.00%) bytes / 468 (3.20%) shows
RHAETO_ROMANCE: 1151 (0.00%) words / 5206 (0.00%) chars / 5876 (0.00%) bytes / 854 (5.84%) shows
KYRGYZ: 789 (0.00%) words / 5030 (0.00%) chars / 7789 (0.00%) bytes / 332 (2.27%) shows
SLOVAK: 744 (0.00%) words / 4783 (0.00%) chars / 5132 (0.00%) bytes / 474 (3.24%) shows
POLISH: 684 (0.00%) words / 4713 (0.00%) chars / 4963 (0.00%) bytes / 149 (1.02%) shows
ESPERANTO: 561 (0.00%) words / 4244 (0.00%) chars / 4477 (0.00%) bytes / 357 (2.44%) shows
BISLAMA: 526 (0.00%) words / 4145 (0.00%) chars / 4217 (0.00%) bytes / 369 (2.52%) shows
HAUSA: 570 (0.00%) words / 4098 (0.00%) chars / 4146 (0.00%) bytes / 318 (2.17%) shows
LITHUANIAN: 597 (0.00%) words / 4012 (0.00%) chars / 4190 (0.00%) bytes / 214 (1.46%) shows
TURKMEN: 543 (0.00%) words / 3997 (0.00%) chars / 6546 (0.00%) bytes / 264 (1.80%) shows
YORUBA: 613 (0.00%) words / 3847 (0.00%) chars / 4621 (0.00%) bytes / 461 (3.15%) shows
SESELWA: 548 (0.00%) words / 3846 (0.00%) chars / 4011 (0.00%) bytes / 298 (2.04%) shows
AKAN: 656 (0.00%) words / 3706 (0.00%) chars / 3729 (0.00%) bytes / 153 (1.05%) shows
MALAGASY: 466 (0.00%) words / 3597 (0.00%) chars / 3655 (0.00%) bytes / 297 (2.03%) shows
RUNDI: 415 (0.00%) words / 3496 (0.00%) chars / 3506 (0.00%) bytes / 250 (1.71%) shows
KAZAKH: 479 (0.00%) words / 3466 (0.00%) chars / 6090 (0.00%) bytes / 104 (0.71%) shows
MAORI: 609 (0.00%) words / 3252 (0.00%) chars / 3508 (0.00%) bytes / 278 (1.90%) shows
ALBANIAN: 348 (0.00%) words / 3236 (0.00%) chars / 3467 (0.00%) bytes / 310 (2.12%) shows
AYMARA: 358 (0.00%) words / 3071 (0.00%) chars / 3255 (0.00%) bytes / 258 (1.76%) shows
MALAY: 474 (0.00%) words / 3070 (0.00%) chars / 3161 (0.00%) bytes / 313 (2.14%) shows
FINNISH: 432 (0.00%) words / 2944 (0.00%) chars / 3025 (0.00%) bytes / 254 (1.74%) shows
ICELANDIC: 239 (0.00%) words / 2924 (0.00%) chars / 3021 (0.00%) bytes / 177 (1.21%) shows
ARMENIAN: 489 (0.00%) words / 2913 (0.00%) chars / 5172 (0.00%) bytes / 17 (0.12%) shows
ABKHAZIAN: 356 (0.00%) words / 2889 (0.00%) chars / 5039 (0.00%) bytes / 110 (0.75%) shows
HMONG: 295 (0.00%) words / 2708 (0.00%) chars / 3148 (0.00%) bytes / 237 (1.62%) shows
SWEDISH: 374 (0.00%) words / 2679 (0.00%) chars / 2711 (0.00%) bytes / 233 (1.59%) shows
SLOVENIAN: 307 (0.00%) words / 2653 (0.00%) chars / 2694 (0.00%) bytes / 231 (1.58%) shows
BOSNIAN: 417 (0.00%) words / 2488 (0.00%) chars / 2549 (0.00%) bytes / 21 (0.14%) shows
GREENLANDIC: 279 (0.00%) words / 2419 (0.00%) chars / 2498 (0.00%) bytes / 176 (1.20%) shows
SOMALI: 340 (0.00%) words / 2314 (0.00%) chars / 2408 (0.00%) bytes / 203 (1.39%) shows
AFAR: 385 (0.00%) words / 2248 (0.00%) chars / 2410 (0.00%) bytes / 231 (1.58%) shows
UIGHUR: 287 (0.00%) words / 2057 (0.00%) chars / 3303 (0.00%) bytes / 157 (1.07%) shows
SWAHILI: 314 (0.00%) words / 2022 (0.00%) chars / 2067 (0.00%) bytes / 198 (1.35%) shows
AFRIKAANS: 240 (0.00%) words / 2013 (0.00%) chars / 2032 (0.00%) bytes / 148 (1.01%) shows
Chinese: 971 (0.00%) words / 2004 (0.00%) chars / 3913 (0.00%) bytes / 18 (0.12%) shows
VIETNAMESE: 312 (0.00%) words / 1821 (0.00%) chars / 1965 (0.00%) bytes / 184 (1.26%) shows
INTERLINGUE: 319 (0.00%) words / 1774 (0.00%) chars / 1791 (0.00%) bytes / 229 (1.57%) shows
MANX: 252 (0.00%) words / 1663 (0.00%) chars / 1684 (0.00%) bytes / 148 (1.01%) shows
OROMO: 187 (0.00%) words / 1579 (0.00%) chars / 1608 (0.00%) bytes / 110 (0.75%) shows
MALTESE: 166 (0.00%) words / 1504 (0.00%) chars / 1519 (0.00%) bytes / 118 (0.81%) shows
JAVANESE: 196 (0.00%) words / 1461 (0.00%) chars / 1484 (0.00%) bytes / 125 (0.85%) shows
CATALAN: 220 (0.00%) words / 1456 (0.00%) chars / 1524 (0.00%) bytes / 122 (0.83%) shows
FRISIAN: 170 (0.00%) words / 1439 (0.00%) chars / 1487 (0.00%) bytes / 106 (0.72%) shows
CORSICAN: 216 (0.00%) words / 1438 (0.00%) chars / 1459 (0.00%) bytes / 105 (0.72%) shows
SESOTHO: 235 (0.00%) words / 1432 (0.00%) chars / 1451 (0.00%) bytes / 139 (0.95%) shows
SAMOAN: 217 (0.00%) words / 1354 (0.00%) chars / 1396 (0.00%) bytes / 152 (1.04%) shows
HINDI: 294 (0.00%) words / 1353 (0.00%) chars / 3340 (0.00%) bytes / 15 (0.10%) shows
INUPIAK: 198 (0.00%) words / 1352 (0.00%) chars / 1356 (0.00%) bytes / 127 (0.87%) shows
QUECHUA: 198 (0.00%) words / 1324 (0.00%) chars / 1352 (0.00%) bytes / 132 (0.90%) shows
HEBREW: 234 (0.00%) words / 1304 (0.00%) chars / 2282 (0.00%) bytes / 8 (0.05%) shows
TONGA: 164 (0.00%) words / 1276 (0.00%) chars / 1351 (0.00%) bytes / 120 (0.82%) shows
X_KLINGON: 184 (0.00%) words / 1234 (0.00%) chars / 1256 (0.00%) bytes / 130 (0.89%) shows
AZERBAIJANI: 185 (0.00%) words / 1233 (0.00%) chars / 1394 (0.00%) bytes / 46 (0.31%) shows
SHONA: 206 (0.00%) words / 1194 (0.00%) chars / 1237 (0.00%) bytes / 142 (0.97%) shows
INTERLINGUA: 179 (0.00%) words / 1109 (0.00%) chars / 1130 (0.00%) bytes / 80 (0.55%) shows
GREEK: 220 (0.00%) words / 1049 (0.00%) chars / 1855 (0.00%) bytes / 31 (0.21%) shows
TAGALOG: 198 (0.00%) words / 1041 (0.00%) chars / 1096 (0.00%) bytes / 160 (1.09%) shows
Japanese: 372 (0.00%) words / 1019 (0.00%) chars / 2289 (0.00%) bytes / 22 (0.15%) shows
WOLOF: 203 (0.00%) words / 1018 (0.00%) chars / 1178 (0.00%) bytes / 162 (1.11%) shows
GUARANI: 179 (0.00%) words / 1009 (0.00%) chars / 1042 (0.00%) bytes / 131 (0.90%) shows
SUNDANESE: 154 (0.00%) words / 942 (0.00%) chars / 1006 (0.00%) bytes / 90 (0.62%) shows
XHOSA: 146 (0.00%) words / 894 (0.00%) chars / 934 (0.00%) bytes / 84 (0.57%) shows
ChineseT: 176 (0.00%) words / 892 (0.00%) chars / 2196 (0.00%) bytes / 19 (0.13%) shows
CROATIAN: 138 (0.00%) words / 826 (0.00%) chars / 837 (0.00%) bytes / 94 (0.64%) shows
PERSIAN: 164 (0.00%) words / 786 (0.00%) chars / 1400 (0.00%) bytes / 9 (0.06%) shows
LINGALA: 117 (0.00%) words / 786 (0.00%) chars / 817 (0.00%) bytes / 79 (0.54%) shows
TSWANA: 165 (0.00%) words / 731 (0.00%) chars / 787 (0.00%) bytes / 122 (0.83%) shows
CEBUANO: 81 (0.00%) words / 702 (0.00%) chars / 725 (0.00%) bytes / 62 (0.42%) shows
KINYARWANDA: 143 (0.00%) words / 700 (0.00%) chars / 721 (0.00%) bytes / 96 (0.66%) shows
PEDI: 115 (0.00%) words / 691 (0.00%) chars / 697 (0.00%) bytes / 98 (0.67%) shows
FIJIAN: 66 (0.00%) words / 492 (0.00%) chars / 519 (0.00%) bytes / 40 (0.27%) shows
WARAY_PHILIPPINES: 68 (0.00%) words / 482 (0.00%) chars / 487 (0.00%) bytes / 39 (0.27%) shows
HAITIAN_CREOLE: 97 (0.00%) words / 477 (0.00%) chars / 493 (0.00%) bytes / 71 (0.49%) shows
HAWAIIAN: 78 (0.00%) words / 432 (0.00%) chars / 454 (0.00%) bytes / 53 (0.36%) shows
OCCITAN: 67 (0.00%) words / 403 (0.00%) chars / 411 (0.00%) bytes / 47 (0.32%) shows
Korean: 75 (0.00%) words / 295 (0.00%) chars / 683 (0.00%) bytes / 10 (0.07%) shows
IGBO: 68 (0.00%) words / 277 (0.00%) chars / 284 (0.00%) bytes / 58 (0.40%) shows
TSONGA: 55 (0.00%) words / 275 (0.00%) chars / 288 (0.00%) bytes / 43 (0.29%) shows
NAURU: 38 (0.00%) words / 261 (0.00%) chars / 265 (0.00%) bytes / 28 (0.19%) shows
THAI: 63 (0.00%) words / 239 (0.00%) chars / 589 (0.00%) bytes / 2 (0.01%) shows
ZULU: 39 (0.00%) words / 226 (0.00%) chars / 241 (0.00%) bytes / 24 (0.16%) shows
ZHUANG: 35 (0.00%) words / 212 (0.00%) chars / 216 (0.00%) bytes / 20 (0.14%) shows
SINHALESE: 34 (0.00%) words / 191 (0.00%) chars / 495 (0.00%) bytes / 1 (0.01%) shows
GANDA: 32 (0.00%) words / 191 (0.00%) chars / 197 (0.00%) bytes / 21 (0.14%) shows
VENDA: 40 (0.00%) words / 138 (0.00%) chars / 143 (0.00%) bytes / 34 (0.23%) shows
SISWANT: 15 (0.00%) words / 104 (0.00%) chars / 106 (0.00%) bytes / 14 (0.10%) shows
NYANJA: 13 (0.00%) words / 95 (0.00%) chars / 95 (0.00%) bytes / 8 (0.05%) shows
BENGALI: 17 (0.00%) words / 91 (0.00%) chars / 123 (0.00%) bytes / 9 (0.06%) shows
BIHARI: 17 (0.00%) words / 75 (0.00%) chars / 177 (0.00%) bytes / 3 (0.02%) shows
KHMER: 8 (0.00%) words / 39 (0.00%) chars / 101 (0.00%) bytes / 1 (0.01%) shows
GUJARATI: 7 (0.00%) words / 29 (0.00%) chars / 75 (0.00%) bytes / 1 (0.01%) shows
SINDHI: 5 (0.00%) words / 22 (0.00%) chars / 31 (0.00%) bytes / 2 (0.01%) shows
MARATHI: 2 (0.00%) words / 16 (0.00%) chars / 40 (0.00%) bytes / 1 (0.01%) shows
SANGO: 1 (0.00%) words / 11 (0.00%) chars / 11 (0.00%) bytes / 1 (0.01%) shows
TOTAL: 97941975 words / 643132321 chars / 1159093605 bytes / 14629 shows