Transcribing 2.5M Hours Of TV News: A More Advanced Workflow For Our Russian TV News Channel Language Detection

Last week we unveiled the first results of applying at-scale language detection to our massive 2.5 million hour global television news transcription archive of the Internet Archive's TV News Archive. In that analysis, we applied CLD2 to the Chirp transcripts with strong results, but with rough edges around detection errors. To segment the transcripts into monolingual chunks, we use CLD2's recognition vectors, which attempt to detect the boundaries between different languages in the text and assign each block of text its most likely language. While offering strong segmentation capabilities, CLD2's language estimate for each chunk can be less accurate than when it is simply run on a single block of text. To explore whether we could improve the results through making use of CLD2's better performance on single monolingual chunks, we modified our language detection workflow slightly to run CLD2 in a first segmentation pass, then rerun it on each individual text chunk. Overall, the results show little difference, though on some edge cases it does perform slightly better, while the added cost is not prohibitive.

RUSSIAN: 97171113 (98.57%) words / 637667261 (98.50%) chars / 1150440600 (98.60%) bytes / 14725 (99.99%) shows
Unknown: 924871 (0.94%) words / 6743790 (1.04%) chars / 12248997 (1.05%) bytes / 12905 (87.63%) shows
SERBIAN: 157167 (0.16%) words / 979254 (0.15%) chars / 1719460 (0.15%) bytes / 8150 (55.34%) shows
ENGLISH: 163237 (0.17%) words / 917167 (0.14%) chars / 937436 (0.08%) bytes / 12126 (82.34%) shows
UKRAINIAN: 32412 (0.03%) words / 202820 (0.03%) chars / 356980 (0.03%) bytes / 1738 (11.80%) shows
GERMAN: 10442 (0.01%) words / 66248 (0.01%) chars / 67249 (0.01%) bytes / 699 (4.75%) shows
LATIN: 6542 (0.01%) words / 63741 (0.01%) chars / 64200 (0.01%) bytes / 3410 (23.16%) shows
BULGARIAN: 8839 (0.01%) words / 58917 (0.01%) chars / 96067 (0.01%) bytes / 1530 (10.39%) shows
BELARUSIAN: 8134 (0.01%) words / 52834 (0.01%) chars / 92709 (0.01%) bytes / 959 (6.51%) shows
ITALIAN: 8084 (0.01%) words / 46521 (0.01%) chars / 47163 (0.00%) bytes / 309 (2.10%) shows
DANISH: 5398 (0.01%) words / 38360 (0.01%) chars / 38932 (0.00%) bytes / 2564 (17.41%) shows
FRENCH: 5651 (0.01%) words / 31860 (0.00%) chars / 32633 (0.00%) bytes / 453 (3.08%) shows
SPANISH: 4077 (0.00%) words / 22539 (0.00%) chars / 23091 (0.00%) bytes / 321 (2.18%) shows
CZECH: 3408 (0.00%) words / 19342 (0.00%) chars / 19658 (0.00%) bytes / 1006 (6.83%) shows
ROMANIAN: 2422 (0.00%) words / 16816 (0.00%) chars / 23683 (0.00%) bytes / 895 (6.08%) shows
MACEDONIAN: 2305 (0.00%) words / 16585 (0.00%) chars / 27233 (0.00%) bytes / 984 (6.68%) shows
BASQUE: 2268 (0.00%) words / 16182 (0.00%) chars / 17030 (0.00%) bytes / 1377 (9.35%) shows
DUTCH: 3452 (0.00%) words / 15899 (0.00%) chars / 16358 (0.00%) bytes / 2244 (15.24%) shows
INDONESIAN: 2393 (0.00%) words / 13246 (0.00%) chars / 13515 (0.00%) bytes / 1346 (9.14%) shows
NORWEGIAN_N: 2155 (0.00%) words / 13001 (0.00%) chars / 13361 (0.00%) bytes / 776 (5.27%) shows
SCOTS: 1588 (0.00%) words / 11725 (0.00%) chars / 11825 (0.00%) bytes / 819 (5.56%) shows
SCOTS_GAELIC: 2575 (0.00%) words / 11193 (0.00%) chars / 11347 (0.00%) bytes / 782 (5.31%) shows
ARABIC: 2020 (0.00%) words / 10920 (0.00%) chars / 19732 (0.00%) bytes / 30 (0.20%) shows
MAURITIAN_CREOLE: 1257 (0.00%) words / 10832 (0.00%) chars / 11444 (0.00%) bytes / 457 (3.10%) shows
FAROESE: 1571 (0.00%) words / 10778 (0.00%) chars / 11102 (0.00%) bytes / 469 (3.18%) shows
UZBEK: 1375 (0.00%) words / 9835 (0.00%) chars / 13260 (0.00%) bytes / 650 (4.41%) shows
TAJIK: 1433 (0.00%) words / 9783 (0.00%) chars / 15135 (0.00%) bytes / 482 (3.27%) shows
LATVIAN: 1194 (0.00%) words / 8400 (0.00%) chars / 8586 (0.00%) bytes / 678 (4.60%) shows
GALICIAN: 1669 (0.00%) words / 8349 (0.00%) chars / 8411 (0.00%) bytes / 1157 (7.86%) shows
X_PIG_LATIN: 1139 (0.00%) words / 8266 (0.00%) chars / 8351 (0.00%) bytes / 675 (4.58%) shows
ESTONIAN: 1165 (0.00%) words / 7720 (0.00%) chars / 8455 (0.00%) bytes / 354 (2.40%) shows
BASHKIR: 1011 (0.00%) words / 7713 (0.00%) chars / 12748 (0.00%) bytes / 504 (3.42%) shows
NORWEGIAN: 1150 (0.00%) words / 7289 (0.00%) chars / 7486 (0.00%) bytes / 746 (5.07%) shows
GEORGIAN: 1022 (0.00%) words / 7214 (0.00%) chars / 18884 (0.00%) bytes / 41 (0.28%) shows
BRETON: 1193 (0.00%) words / 6735 (0.00%) chars / 6845 (0.00%) bytes / 784 (5.32%) shows
LUXEMBOURGISH: 1118 (0.00%) words / 6710 (0.00%) chars / 6886 (0.00%) bytes / 865 (5.87%) shows
VOLAPUK: 1627 (0.00%) words / 6634 (0.00%) chars / 6701 (0.00%) bytes / 490 (3.33%) shows
PORTUGUESE: 1363 (0.00%) words / 6450 (0.00%) chars / 6662 (0.00%) bytes / 627 (4.26%) shows
IRISH: 1712 (0.00%) words / 6441 (0.00%) chars / 6455 (0.00%) bytes / 1084 (7.36%) shows
KHASI: 1161 (0.00%) words / 6199 (0.00%) chars / 6225 (0.00%) bytes / 519 (3.52%) shows
MONGOLIAN: 927 (0.00%) words / 5993 (0.00%) chars / 9127 (0.00%) bytes / 418 (2.84%) shows
TATAR: 945 (0.00%) words / 5802 (0.00%) chars / 7884 (0.00%) bytes / 526 (3.57%) shows
TURKISH: 962 (0.00%) words / 5687 (0.00%) chars / 6160 (0.00%) bytes / 417 (2.83%) shows
HUNGARIAN: 746 (0.00%) words / 5475 (0.00%) chars / 5522 (0.00%) bytes / 435 (2.95%) shows
WELSH: 787 (0.00%) words / 5293 (0.00%) chars / 5524 (0.00%) bytes / 474 (3.22%) shows
SANSKRIT: 557 (0.00%) words / 5286 (0.00%) chars / 5312 (0.00%) bytes / 322 (2.19%) shows
RHAETO_ROMANCE: 1157 (0.00%) words / 5235 (0.00%) chars / 5906 (0.00%) bytes / 860 (5.84%) shows
KYRGYZ: 805 (0.00%) words / 5150 (0.00%) chars / 8005 (0.00%) bytes / 333 (2.26%) shows
POLISH: 694 (0.00%) words / 4789 (0.00%) chars / 5039 (0.00%) bytes / 157 (1.07%) shows
SLOVAK: 715 (0.00%) words / 4710 (0.00%) chars / 5045 (0.00%) bytes / 485 (3.29%) shows
BISLAMA: 548 (0.00%) words / 4329 (0.00%) chars / 4401 (0.00%) bytes / 384 (2.61%) shows
ESPERANTO: 565 (0.00%) words / 4280 (0.00%) chars / 4515 (0.00%) bytes / 359 (2.44%) shows
HAUSA: 571 (0.00%) words / 4104 (0.00%) chars / 4152 (0.00%) bytes / 319 (2.17%) shows
BOSNIAN: 669 (0.00%) words / 4067 (0.00%) chars / 4167 (0.00%) bytes / 31 (0.21%) shows
TURKMEN: 551 (0.00%) words / 4037 (0.00%) chars / 6606 (0.00%) bytes / 267 (1.81%) shows
LITHUANIAN: 597 (0.00%) words / 4012 (0.00%) chars / 4190 (0.00%) bytes / 214 (1.45%) shows
YORUBA: 617 (0.00%) words / 3863 (0.00%) chars / 4637 (0.00%) bytes / 464 (3.15%) shows
SESELWA: 548 (0.00%) words / 3846 (0.00%) chars / 4011 (0.00%) bytes / 298 (2.02%) shows
AKAN: 656 (0.00%) words / 3706 (0.00%) chars / 3729 (0.00%) bytes / 153 (1.04%) shows
MALAGASY: 477 (0.00%) words / 3682 (0.00%) chars / 3745 (0.00%) bytes / 302 (2.05%) shows
RUNDI: 422 (0.00%) words / 3564 (0.00%) chars / 3574 (0.00%) bytes / 254 (1.72%) shows
MAORI: 651 (0.00%) words / 3531 (0.00%) chars / 3787 (0.00%) bytes / 302 (2.05%) shows
KAZAKH: 465 (0.00%) words / 3356 (0.00%) chars / 5888 (0.00%) bytes / 103 (0.70%) shows
ALBANIAN: 348 (0.00%) words / 3236 (0.00%) chars / 3467 (0.00%) bytes / 310 (2.11%) shows
AYMARA: 358 (0.00%) words / 3071 (0.00%) chars / 3255 (0.00%) bytes / 258 (1.75%) shows
MALAY: 460 (0.00%) words / 3012 (0.00%) chars / 3103 (0.00%) bytes / 311 (2.11%) shows
FINNISH: 435 (0.00%) words / 2972 (0.00%) chars / 3054 (0.00%) bytes / 256 (1.74%) shows
ABKHAZIAN: 362 (0.00%) words / 2935 (0.00%) chars / 5123 (0.00%) bytes / 111 (0.75%) shows
ICELANDIC: 239 (0.00%) words / 2924 (0.00%) chars / 3021 (0.00%) bytes / 177 (1.20%) shows
ARMENIAN: 489 (0.00%) words / 2913 (0.00%) chars / 5172 (0.00%) bytes / 17 (0.12%) shows
HMONG: 297 (0.00%) words / 2732 (0.00%) chars / 3177 (0.00%) bytes / 238 (1.62%) shows
SWEDISH: 376 (0.00%) words / 2698 (0.00%) chars / 2730 (0.00%) bytes / 235 (1.60%) shows
SLOVENIAN: 307 (0.00%) words / 2653 (0.00%) chars / 2694 (0.00%) bytes / 231 (1.57%) shows
GREENLANDIC: 279 (0.00%) words / 2419 (0.00%) chars / 2498 (0.00%) bytes / 176 (1.20%) shows
CROATIAN: 395 (0.00%) words / 2403 (0.00%) chars / 2433 (0.00%) bytes / 183 (1.24%) shows
SOMALI: 340 (0.00%) words / 2314 (0.00%) chars / 2408 (0.00%) bytes / 203 (1.38%) shows
AFAR: 387 (0.00%) words / 2265 (0.00%) chars / 2431 (0.00%) bytes / 233 (1.58%) shows
UIGHUR: 298 (0.00%) words / 2150 (0.00%) chars / 3477 (0.00%) bytes / 162 (1.10%) shows
AFRIKAANS: 243 (0.00%) words / 2045 (0.00%) chars / 2064 (0.00%) bytes / 151 (1.03%) shows
SWAHILI: 318 (0.00%) words / 2039 (0.00%) chars / 2085 (0.00%) bytes / 202 (1.37%) shows
Chinese: 971 (0.00%) words / 2004 (0.00%) chars / 3913 (0.00%) bytes / 18 (0.12%) shows
MANX: 275 (0.00%) words / 1850 (0.00%) chars / 1871 (0.00%) bytes / 168 (1.14%) shows
VIETNAMESE: 314 (0.00%) words / 1832 (0.00%) chars / 1977 (0.00%) bytes / 186 (1.26%) shows
INTERLINGUE: 328 (0.00%) words / 1814 (0.00%) chars / 1831 (0.00%) bytes / 234 (1.59%) shows
OROMO: 187 (0.00%) words / 1579 (0.00%) chars / 1608 (0.00%) bytes / 110 (0.75%) shows
MALTESE: 166 (0.00%) words / 1504 (0.00%) chars / 1519 (0.00%) bytes / 118 (0.80%) shows
CATALAN: 221 (0.00%) words / 1463 (0.00%) chars / 1531 (0.00%) bytes / 123 (0.84%) shows
JAVANESE: 196 (0.00%) words / 1461 (0.00%) chars / 1484 (0.00%) bytes / 125 (0.85%) shows
FRISIAN: 170 (0.00%) words / 1439 (0.00%) chars / 1487 (0.00%) bytes / 106 (0.72%) shows
CORSICAN: 216 (0.00%) words / 1438 (0.00%) chars / 1459 (0.00%) bytes / 105 (0.71%) shows
SESOTHO: 235 (0.00%) words / 1432 (0.00%) chars / 1451 (0.00%) bytes / 139 (0.94%) shows
SAMOAN: 217 (0.00%) words / 1354 (0.00%) chars / 1396 (0.00%) bytes / 152 (1.03%) shows
HINDI: 294 (0.00%) words / 1353 (0.00%) chars / 3340 (0.00%) bytes / 15 (0.10%) shows
INUPIAK: 198 (0.00%) words / 1352 (0.00%) chars / 1356 (0.00%) bytes / 127 (0.86%) shows
QUECHUA: 199 (0.00%) words / 1330 (0.00%) chars / 1358 (0.00%) bytes / 133 (0.90%) shows
HEBREW: 234 (0.00%) words / 1304 (0.00%) chars / 2282 (0.00%) bytes / 8 (0.05%) shows
TONGA: 164 (0.00%) words / 1276 (0.00%) chars / 1351 (0.00%) bytes / 120 (0.81%) shows
X_KLINGON: 184 (0.00%) words / 1234 (0.00%) chars / 1256 (0.00%) bytes / 130 (0.88%) shows
AZERBAIJANI: 185 (0.00%) words / 1233 (0.00%) chars / 1394 (0.00%) bytes / 46 (0.31%) shows
SHONA: 207 (0.00%) words / 1201 (0.00%) chars / 1246 (0.00%) bytes / 143 (0.97%) shows
INTERLINGUA: 180 (0.00%) words / 1119 (0.00%) chars / 1140 (0.00%) bytes / 81 (0.55%) shows
GREEK: 220 (0.00%) words / 1049 (0.00%) chars / 1855 (0.00%) bytes / 31 (0.21%) shows
TAGALOG: 199 (0.00%) words / 1046 (0.00%) chars / 1101 (0.00%) bytes / 161 (1.09%) shows
Japanese: 372 (0.00%) words / 1019 (0.00%) chars / 2289 (0.00%) bytes / 22 (0.15%) shows
WOLOF: 203 (0.00%) words / 1018 (0.00%) chars / 1178 (0.00%) bytes / 162 (1.10%) shows
GUARANI: 180 (0.00%) words / 1014 (0.00%) chars / 1047 (0.00%) bytes / 132 (0.90%) shows
SUNDANESE: 154 (0.00%) words / 942 (0.00%) chars / 1006 (0.00%) bytes / 90 (0.61%) shows
ChineseT: 176 (0.00%) words / 892 (0.00%) chars / 2196 (0.00%) bytes / 19 (0.13%) shows
XHOSA: 136 (0.00%) words / 847 (0.00%) chars / 884 (0.00%) bytes / 82 (0.56%) shows
LINGALA: 118 (0.00%) words / 794 (0.00%) chars / 825 (0.00%) bytes / 80 (0.54%) shows
PERSIAN: 164 (0.00%) words / 786 (0.00%) chars / 1400 (0.00%) bytes / 9 (0.06%) shows
TSWANA: 165 (0.00%) words / 731 (0.00%) chars / 787 (0.00%) bytes / 122 (0.83%) shows
CEBUANO: 81 (0.00%) words / 702 (0.00%) chars / 725 (0.00%) bytes / 62 (0.42%) shows
PEDI: 115 (0.00%) words / 691 (0.00%) chars / 697 (0.00%) bytes / 98 (0.67%) shows
KINYARWANDA: 136 (0.00%) words / 624 (0.00%) chars / 636 (0.00%) bytes / 93 (0.63%) shows
HAITIAN_CREOLE: 99 (0.00%) words / 497 (0.00%) chars / 517 (0.00%) bytes / 73 (0.50%) shows
FIJIAN: 66 (0.00%) words / 492 (0.00%) chars / 519 (0.00%) bytes / 40 (0.27%) shows
WARAY_PHILIPPINES: 68 (0.00%) words / 482 (0.00%) chars / 487 (0.00%) bytes / 39 (0.26%) shows
HAWAIIAN: 78 (0.00%) words / 432 (0.00%) chars / 454 (0.00%) bytes / 53 (0.36%) shows
OCCITAN: 67 (0.00%) words / 403 (0.00%) chars / 411 (0.00%) bytes / 47 (0.32%) shows
Korean: 75 (0.00%) words / 295 (0.00%) chars / 683 (0.00%) bytes / 10 (0.07%) shows
IGBO: 68 (0.00%) words / 277 (0.00%) chars / 284 (0.00%) bytes / 58 (0.39%) shows
TSONGA: 55 (0.00%) words / 275 (0.00%) chars / 288 (0.00%) bytes / 43 (0.29%) shows
NAURU: 38 (0.00%) words / 261 (0.00%) chars / 265 (0.00%) bytes / 28 (0.19%) shows
THAI: 63 (0.00%) words / 239 (0.00%) chars / 589 (0.00%) bytes / 2 (0.01%) shows
ZHUANG: 35 (0.00%) words / 212 (0.00%) chars / 216 (0.00%) bytes / 20 (0.14%) shows
GANDA: 32 (0.00%) words / 191 (0.00%) chars / 197 (0.00%) bytes / 21 (0.14%) shows
SINHALESE: 34 (0.00%) words / 191 (0.00%) chars / 495 (0.00%) bytes / 1 (0.01%) shows
ZULU: 32 (0.00%) words / 188 (0.00%) chars / 202 (0.00%) bytes / 24 (0.16%) shows
VENDA: 40 (0.00%) words / 138 (0.00%) chars / 143 (0.00%) bytes / 34 (0.23%) shows
SISWANT: 15 (0.00%) words / 104 (0.00%) chars / 106 (0.00%) bytes / 14 (0.10%) shows
NYANJA: 13 (0.00%) words / 95 (0.00%) chars / 95 (0.00%) bytes / 8 (0.05%) shows
BENGALI: 17 (0.00%) words / 91 (0.00%) chars / 123 (0.00%) bytes / 9 (0.06%) shows
BIHARI: 17 (0.00%) words / 75 (0.00%) chars / 177 (0.00%) bytes / 3 (0.02%) shows
KHMER: 8 (0.00%) words / 39 (0.00%) chars / 101 (0.00%) bytes / 1 (0.01%) shows
GUJARATI: 7 (0.00%) words / 29 (0.00%) chars / 75 (0.00%) bytes / 1 (0.01%) shows
SINDHI: 5 (0.00%) words / 22 (0.00%) chars / 31 (0.00%) bytes / 2 (0.01%) shows
MARATHI: 2 (0.00%) words / 16 (0.00%) chars / 40 (0.00%) bytes / 1 (0.01%) shows
SANGO: 1 (0.00%) words / 11 (0.00%) chars / 11 (0.00%) bytes / 1 (0.01%) shows
TOTAL: 98582971 words / 647367929 chars / 1166735398 bytes / 14726 shows