As we continue to apply CLD2 language detection across the more than 2.5 million hours of machine transcribed global television news coverage from the Internet Archive's TV News Archive that we finished transcribing earlier this month, below is a first look at the 139 languages detected by CLD2 in the two years of Russia 1 that the Archive has monitored. The Archive began monitoring Russia 1 on March 26, 2022, just over a month after the Russian invasion of Ukraine, meaning that there is now nearly two complete years of coverage that we can use to understand the linguistic landscape of Russian television news. Recall that we used Google's Chirp LSM ASR model to machine transcribe the complete archive and then applied the CLD2 language detector to the resulting transcript archive.
The end result is that over the last two years, just over 97.9M words have been spoken on Russia 1 totaling 643M characters yielding 1.16B bytes (since Cyrillic is a multibyte characterset) across 14,629 distinct broadcasts. Of those words, 98.57% were in Russian, yielding the unsurprisingly conclusion that Russian broadcasts are fairly monolingual. Yet, our CLD2 language detection flagged a total of 139 distinct languages spoken amongst all of that content, ranging from a single word to a few hundred or thousand words for most. Interestingly, while English accounts for just 0.16% of all spoken words, 82.23% of all broadcasts contain at least one English word, demonstrating how common English loanwords are, especially for the names of global leaders and organizations.
What accounts for the vast long tail of languages seen below? Many of these are simply CLD2 recognition errors of loanwords – the result of one or two words of English or another language appearing in the midst of a block of Russian. A single English word representing a first name or a location can easily cause a language misidentification, given that it can be nearly impossible to robustly identify the language of a single isolated word (especially for words that appear in many languages). For example, in the transcript "Президент Macron объявил", should "Macron" be labeled as English or French? In some cases, these loanwords are legitimately the name of a company or government official from a given country, yielding a correct, if misleading, language identification of that country's dominate language. Indeed, language misidentification (errors) and isolated proper names (correct, but misleading) account for many of the long-tail identifications below. At the same time, Chirp, like all Large Speech Models (LSMs) can falsely randomly insert isolated characters or entire words from unrelated languages in the midst of a transcription, meaning that some recognized languages below are correct identifications of languages that were incorrectly transcribed by Chirp. Finally, and most fascinatingly, some of the most unexpected languages below actually represent accurate language detection of accurate transcriptions that appear highly unexpectedly, from brief clips of global events to excerpts of foreign language content.
RUSSIAN: 96545798 (98.57%) words / 633547459 (98.51%) chars / 1142999293 (98.61%) bytes / 14628 (99.99%) shows Unknown: 906777 (0.93%) words / 6612530 (1.03%) chars / 12009182 (1.04%) bytes / 12813 (87.59%) shows SERBIAN: 162909 (0.17%) words / 1015512 (0.16%) chars / 1781750 (0.15%) bytes / 8274 (56.56%) shows ENGLISH: 160576 (0.16%) words / 901489 (0.14%) chars / 921255 (0.08%) bytes / 12029 (82.23%) shows UKRAINIAN: 32507 (0.03%) words / 203274 (0.03%) chars / 357812 (0.03%) bytes / 1736 (11.87%) shows GERMAN: 10428 (0.01%) words / 66164 (0.01%) chars / 67162 (0.01%) bytes / 697 (4.76%) shows LATIN: 6507 (0.01%) words / 63431 (0.01%) chars / 63890 (0.01%) bytes / 3400 (23.24%) shows BULGARIAN: 8017 (0.01%) words / 54304 (0.01%) chars / 89534 (0.01%) bytes / 1424 (9.73%) shows BELARUSIAN: 8125 (0.01%) words / 52714 (0.01%) chars / 92487 (0.01%) bytes / 952 (6.51%) shows ITALIAN: 7827 (0.01%) words / 44936 (0.01%) chars / 45564 (0.00%) bytes / 305 (2.08%) shows DANISH: 5519 (0.01%) words / 38842 (0.01%) chars / 39471 (0.00%) bytes / 2562 (17.51%) shows FRENCH: 5649 (0.01%) words / 31851 (0.00%) chars / 32624 (0.00%) bytes / 452 (3.09%) shows SPANISH: 4225 (0.00%) words / 23271 (0.00%) chars / 23865 (0.00%) bytes / 318 (2.17%) shows CZECH: 3615 (0.00%) words / 20505 (0.00%) chars / 20923 (0.00%) bytes / 999 (6.83%) shows ROMANIAN: 2397 (0.00%) words / 16659 (0.00%) chars / 23431 (0.00%) bytes / 881 (6.02%) shows MACEDONIAN: 2286 (0.00%) words / 16412 (0.00%) chars / 26943 (0.00%) bytes / 974 (6.66%) shows BASQUE: 2264 (0.00%) words / 16158 (0.00%) chars / 17006 (0.00%) bytes / 1375 (9.40%) shows DUTCH: 3447 (0.00%) words / 15882 (0.00%) chars / 16341 (0.00%) bytes / 2239 (15.31%) shows NORWEGIAN_N: 2204 (0.00%) words / 13505 (0.00%) chars / 13941 (0.00%) bytes / 765 (5.23%) shows INDONESIAN: 2385 (0.00%) words / 13190 (0.00%) chars / 13481 (0.00%) bytes / 1334 (9.12%) shows SCOTS: 1578 (0.00%) words / 11663 (0.00%) chars / 11762 (0.00%) bytes / 815 (5.57%) shows SCOTS_GAELIC: 2572 (0.00%) words / 11177 (0.00%) chars / 11331 (0.00%) bytes / 780 (5.33%) shows ARABIC: 2020 (0.00%) words / 10920 (0.00%) chars / 19732 (0.00%) bytes / 30 (0.21%) shows MAURITIAN_CREOLE: 1256 (0.00%) words / 10826 (0.00%) chars / 11438 (0.00%) bytes / 456 (3.12%) shows FAROESE: 1550 (0.00%) words / 10659 (0.00%) chars / 10983 (0.00%) bytes / 463 (3.16%) shows PORTUGUESE: 2088 (0.00%) words / 10425 (0.00%) chars / 10902 (0.00%) bytes / 662 (4.53%) shows UZBEK: 1365 (0.00%) words / 9761 (0.00%) chars / 13166 (0.00%) bytes / 647 (4.42%) shows TAJIK: 1386 (0.00%) words / 9493 (0.00%) chars / 14705 (0.00%) bytes / 465 (3.18%) shows GALICIAN: 1795 (0.00%) words / 9080 (0.00%) chars / 9246 (0.00%) bytes / 1120 (7.66%) shows X_PIG_LATIN: 1138 (0.00%) words / 8258 (0.00%) chars / 8343 (0.00%) bytes / 674 (4.61%) shows LATVIAN: 1160 (0.00%) words / 8175 (0.00%) chars / 8352 (0.00%) bytes / 672 (4.59%) shows ESTONIAN: 1165 (0.00%) words / 7720 (0.00%) chars / 8455 (0.00%) bytes / 354 (2.42%) shows BASHKIR: 999 (0.00%) words / 7611 (0.00%) chars / 12588 (0.00%) bytes / 500 (3.42%) shows GEORGIAN: 1022 (0.00%) words / 7214 (0.00%) chars / 18884 (0.00%) bytes / 41 (0.28%) shows NORWEGIAN: 1056 (0.00%) words / 6823 (0.00%) chars / 7058 (0.00%) bytes / 671 (4.59%) shows BRETON: 1192 (0.00%) words / 6727 (0.00%) chars / 6837 (0.00%) bytes / 783 (5.35%) shows LUXEMBOURGISH: 1113 (0.00%) words / 6680 (0.00%) chars / 6856 (0.00%) bytes / 861 (5.89%) shows VOLAPUK: 1596 (0.00%) words / 6521 (0.00%) chars / 6588 (0.00%) bytes / 487 (3.33%) shows IRISH: 1704 (0.00%) words / 6407 (0.00%) chars / 6421 (0.00%) bytes / 1077 (7.36%) shows KHASI: 1159 (0.00%) words / 6191 (0.00%) chars / 6217 (0.00%) bytes / 517 (3.53%) shows MONGOLIAN: 920 (0.00%) words / 5943 (0.00%) chars / 9052 (0.00%) bytes / 414 (2.83%) shows TATAR: 937 (0.00%) words / 5737 (0.00%) chars / 7787 (0.00%) bytes / 523 (3.58%) shows TURKISH: 962 (0.00%) words / 5687 (0.00%) chars / 6160 (0.00%) bytes / 417 (2.85%) shows HUNGARIAN: 746 (0.00%) words / 5475 (0.00%) chars / 5522 (0.00%) bytes / 435 (2.97%) shows SANSKRIT: 556 (0.00%) words / 5276 (0.00%) chars / 5302 (0.00%) bytes / 321 (2.19%) shows WELSH: 779 (0.00%) words / 5235 (0.00%) chars / 5464 (0.00%) bytes / 468 (3.20%) shows RHAETO_ROMANCE: 1151 (0.00%) words / 5206 (0.00%) chars / 5876 (0.00%) bytes / 854 (5.84%) shows KYRGYZ: 789 (0.00%) words / 5030 (0.00%) chars / 7789 (0.00%) bytes / 332 (2.27%) shows SLOVAK: 744 (0.00%) words / 4783 (0.00%) chars / 5132 (0.00%) bytes / 474 (3.24%) shows POLISH: 684 (0.00%) words / 4713 (0.00%) chars / 4963 (0.00%) bytes / 149 (1.02%) shows ESPERANTO: 561 (0.00%) words / 4244 (0.00%) chars / 4477 (0.00%) bytes / 357 (2.44%) shows BISLAMA: 526 (0.00%) words / 4145 (0.00%) chars / 4217 (0.00%) bytes / 369 (2.52%) shows HAUSA: 570 (0.00%) words / 4098 (0.00%) chars / 4146 (0.00%) bytes / 318 (2.17%) shows LITHUANIAN: 597 (0.00%) words / 4012 (0.00%) chars / 4190 (0.00%) bytes / 214 (1.46%) shows TURKMEN: 543 (0.00%) words / 3997 (0.00%) chars / 6546 (0.00%) bytes / 264 (1.80%) shows YORUBA: 613 (0.00%) words / 3847 (0.00%) chars / 4621 (0.00%) bytes / 461 (3.15%) shows SESELWA: 548 (0.00%) words / 3846 (0.00%) chars / 4011 (0.00%) bytes / 298 (2.04%) shows AKAN: 656 (0.00%) words / 3706 (0.00%) chars / 3729 (0.00%) bytes / 153 (1.05%) shows MALAGASY: 466 (0.00%) words / 3597 (0.00%) chars / 3655 (0.00%) bytes / 297 (2.03%) shows RUNDI: 415 (0.00%) words / 3496 (0.00%) chars / 3506 (0.00%) bytes / 250 (1.71%) shows KAZAKH: 479 (0.00%) words / 3466 (0.00%) chars / 6090 (0.00%) bytes / 104 (0.71%) shows MAORI: 609 (0.00%) words / 3252 (0.00%) chars / 3508 (0.00%) bytes / 278 (1.90%) shows ALBANIAN: 348 (0.00%) words / 3236 (0.00%) chars / 3467 (0.00%) bytes / 310 (2.12%) shows AYMARA: 358 (0.00%) words / 3071 (0.00%) chars / 3255 (0.00%) bytes / 258 (1.76%) shows MALAY: 474 (0.00%) words / 3070 (0.00%) chars / 3161 (0.00%) bytes / 313 (2.14%) shows FINNISH: 432 (0.00%) words / 2944 (0.00%) chars / 3025 (0.00%) bytes / 254 (1.74%) shows ICELANDIC: 239 (0.00%) words / 2924 (0.00%) chars / 3021 (0.00%) bytes / 177 (1.21%) shows ARMENIAN: 489 (0.00%) words / 2913 (0.00%) chars / 5172 (0.00%) bytes / 17 (0.12%) shows ABKHAZIAN: 356 (0.00%) words / 2889 (0.00%) chars / 5039 (0.00%) bytes / 110 (0.75%) shows HMONG: 295 (0.00%) words / 2708 (0.00%) chars / 3148 (0.00%) bytes / 237 (1.62%) shows SWEDISH: 374 (0.00%) words / 2679 (0.00%) chars / 2711 (0.00%) bytes / 233 (1.59%) shows SLOVENIAN: 307 (0.00%) words / 2653 (0.00%) chars / 2694 (0.00%) bytes / 231 (1.58%) shows BOSNIAN: 417 (0.00%) words / 2488 (0.00%) chars / 2549 (0.00%) bytes / 21 (0.14%) shows GREENLANDIC: 279 (0.00%) words / 2419 (0.00%) chars / 2498 (0.00%) bytes / 176 (1.20%) shows SOMALI: 340 (0.00%) words / 2314 (0.00%) chars / 2408 (0.00%) bytes / 203 (1.39%) shows AFAR: 385 (0.00%) words / 2248 (0.00%) chars / 2410 (0.00%) bytes / 231 (1.58%) shows UIGHUR: 287 (0.00%) words / 2057 (0.00%) chars / 3303 (0.00%) bytes / 157 (1.07%) shows SWAHILI: 314 (0.00%) words / 2022 (0.00%) chars / 2067 (0.00%) bytes / 198 (1.35%) shows AFRIKAANS: 240 (0.00%) words / 2013 (0.00%) chars / 2032 (0.00%) bytes / 148 (1.01%) shows Chinese: 971 (0.00%) words / 2004 (0.00%) chars / 3913 (0.00%) bytes / 18 (0.12%) shows VIETNAMESE: 312 (0.00%) words / 1821 (0.00%) chars / 1965 (0.00%) bytes / 184 (1.26%) shows INTERLINGUE: 319 (0.00%) words / 1774 (0.00%) chars / 1791 (0.00%) bytes / 229 (1.57%) shows MANX: 252 (0.00%) words / 1663 (0.00%) chars / 1684 (0.00%) bytes / 148 (1.01%) shows OROMO: 187 (0.00%) words / 1579 (0.00%) chars / 1608 (0.00%) bytes / 110 (0.75%) shows MALTESE: 166 (0.00%) words / 1504 (0.00%) chars / 1519 (0.00%) bytes / 118 (0.81%) shows JAVANESE: 196 (0.00%) words / 1461 (0.00%) chars / 1484 (0.00%) bytes / 125 (0.85%) shows CATALAN: 220 (0.00%) words / 1456 (0.00%) chars / 1524 (0.00%) bytes / 122 (0.83%) shows FRISIAN: 170 (0.00%) words / 1439 (0.00%) chars / 1487 (0.00%) bytes / 106 (0.72%) shows CORSICAN: 216 (0.00%) words / 1438 (0.00%) chars / 1459 (0.00%) bytes / 105 (0.72%) shows SESOTHO: 235 (0.00%) words / 1432 (0.00%) chars / 1451 (0.00%) bytes / 139 (0.95%) shows SAMOAN: 217 (0.00%) words / 1354 (0.00%) chars / 1396 (0.00%) bytes / 152 (1.04%) shows HINDI: 294 (0.00%) words / 1353 (0.00%) chars / 3340 (0.00%) bytes / 15 (0.10%) shows INUPIAK: 198 (0.00%) words / 1352 (0.00%) chars / 1356 (0.00%) bytes / 127 (0.87%) shows QUECHUA: 198 (0.00%) words / 1324 (0.00%) chars / 1352 (0.00%) bytes / 132 (0.90%) shows HEBREW: 234 (0.00%) words / 1304 (0.00%) chars / 2282 (0.00%) bytes / 8 (0.05%) shows TONGA: 164 (0.00%) words / 1276 (0.00%) chars / 1351 (0.00%) bytes / 120 (0.82%) shows X_KLINGON: 184 (0.00%) words / 1234 (0.00%) chars / 1256 (0.00%) bytes / 130 (0.89%) shows AZERBAIJANI: 185 (0.00%) words / 1233 (0.00%) chars / 1394 (0.00%) bytes / 46 (0.31%) shows SHONA: 206 (0.00%) words / 1194 (0.00%) chars / 1237 (0.00%) bytes / 142 (0.97%) shows INTERLINGUA: 179 (0.00%) words / 1109 (0.00%) chars / 1130 (0.00%) bytes / 80 (0.55%) shows GREEK: 220 (0.00%) words / 1049 (0.00%) chars / 1855 (0.00%) bytes / 31 (0.21%) shows TAGALOG: 198 (0.00%) words / 1041 (0.00%) chars / 1096 (0.00%) bytes / 160 (1.09%) shows Japanese: 372 (0.00%) words / 1019 (0.00%) chars / 2289 (0.00%) bytes / 22 (0.15%) shows WOLOF: 203 (0.00%) words / 1018 (0.00%) chars / 1178 (0.00%) bytes / 162 (1.11%) shows GUARANI: 179 (0.00%) words / 1009 (0.00%) chars / 1042 (0.00%) bytes / 131 (0.90%) shows SUNDANESE: 154 (0.00%) words / 942 (0.00%) chars / 1006 (0.00%) bytes / 90 (0.62%) shows XHOSA: 146 (0.00%) words / 894 (0.00%) chars / 934 (0.00%) bytes / 84 (0.57%) shows ChineseT: 176 (0.00%) words / 892 (0.00%) chars / 2196 (0.00%) bytes / 19 (0.13%) shows CROATIAN: 138 (0.00%) words / 826 (0.00%) chars / 837 (0.00%) bytes / 94 (0.64%) shows PERSIAN: 164 (0.00%) words / 786 (0.00%) chars / 1400 (0.00%) bytes / 9 (0.06%) shows LINGALA: 117 (0.00%) words / 786 (0.00%) chars / 817 (0.00%) bytes / 79 (0.54%) shows TSWANA: 165 (0.00%) words / 731 (0.00%) chars / 787 (0.00%) bytes / 122 (0.83%) shows CEBUANO: 81 (0.00%) words / 702 (0.00%) chars / 725 (0.00%) bytes / 62 (0.42%) shows KINYARWANDA: 143 (0.00%) words / 700 (0.00%) chars / 721 (0.00%) bytes / 96 (0.66%) shows PEDI: 115 (0.00%) words / 691 (0.00%) chars / 697 (0.00%) bytes / 98 (0.67%) shows FIJIAN: 66 (0.00%) words / 492 (0.00%) chars / 519 (0.00%) bytes / 40 (0.27%) shows WARAY_PHILIPPINES: 68 (0.00%) words / 482 (0.00%) chars / 487 (0.00%) bytes / 39 (0.27%) shows HAITIAN_CREOLE: 97 (0.00%) words / 477 (0.00%) chars / 493 (0.00%) bytes / 71 (0.49%) shows HAWAIIAN: 78 (0.00%) words / 432 (0.00%) chars / 454 (0.00%) bytes / 53 (0.36%) shows OCCITAN: 67 (0.00%) words / 403 (0.00%) chars / 411 (0.00%) bytes / 47 (0.32%) shows Korean: 75 (0.00%) words / 295 (0.00%) chars / 683 (0.00%) bytes / 10 (0.07%) shows IGBO: 68 (0.00%) words / 277 (0.00%) chars / 284 (0.00%) bytes / 58 (0.40%) shows TSONGA: 55 (0.00%) words / 275 (0.00%) chars / 288 (0.00%) bytes / 43 (0.29%) shows NAURU: 38 (0.00%) words / 261 (0.00%) chars / 265 (0.00%) bytes / 28 (0.19%) shows THAI: 63 (0.00%) words / 239 (0.00%) chars / 589 (0.00%) bytes / 2 (0.01%) shows ZULU: 39 (0.00%) words / 226 (0.00%) chars / 241 (0.00%) bytes / 24 (0.16%) shows ZHUANG: 35 (0.00%) words / 212 (0.00%) chars / 216 (0.00%) bytes / 20 (0.14%) shows SINHALESE: 34 (0.00%) words / 191 (0.00%) chars / 495 (0.00%) bytes / 1 (0.01%) shows GANDA: 32 (0.00%) words / 191 (0.00%) chars / 197 (0.00%) bytes / 21 (0.14%) shows VENDA: 40 (0.00%) words / 138 (0.00%) chars / 143 (0.00%) bytes / 34 (0.23%) shows SISWANT: 15 (0.00%) words / 104 (0.00%) chars / 106 (0.00%) bytes / 14 (0.10%) shows NYANJA: 13 (0.00%) words / 95 (0.00%) chars / 95 (0.00%) bytes / 8 (0.05%) shows BENGALI: 17 (0.00%) words / 91 (0.00%) chars / 123 (0.00%) bytes / 9 (0.06%) shows BIHARI: 17 (0.00%) words / 75 (0.00%) chars / 177 (0.00%) bytes / 3 (0.02%) shows KHMER: 8 (0.00%) words / 39 (0.00%) chars / 101 (0.00%) bytes / 1 (0.01%) shows GUJARATI: 7 (0.00%) words / 29 (0.00%) chars / 75 (0.00%) bytes / 1 (0.01%) shows SINDHI: 5 (0.00%) words / 22 (0.00%) chars / 31 (0.00%) bytes / 2 (0.01%) shows MARATHI: 2 (0.00%) words / 16 (0.00%) chars / 40 (0.00%) bytes / 1 (0.01%) shows SANGO: 1 (0.00%) words / 11 (0.00%) chars / 11 (0.00%) bytes / 1 (0.01%) shows TOTAL: 97941975 words / 643132321 chars / 1159093605 bytes / 14629 shows