Last week we unveiled the first results of applying at-scale language detection to our massive 2.5 million hour global television news transcription archive of the Internet Archive's TV News Archive. In that analysis, we applied CLD2 to the Chirp transcripts with strong results, but with rough edges around detection errors. To segment the transcripts into monolingual chunks, we use CLD2's recognition vectors, which attempt to detect the boundaries between different languages in the text and assign each block of text its most likely language. While offering strong segmentation capabilities, CLD2's language estimate for each chunk can be less accurate than when it is simply run on a single block of text. To explore whether we could improve the results through making use of CLD2's better performance on single monolingual chunks, we modified our language detection workflow slightly to run CLD2 in a first segmentation pass, then rerun it on each individual text chunk. Overall, the results show little difference, though on some edge cases it does perform slightly better, while the added cost is not prohibitive.
RUSSIAN: 97171113 (98.57%) words / 637667261 (98.50%) chars / 1150440600 (98.60%) bytes / 14725 (99.99%) shows Unknown: 924871 (0.94%) words / 6743790 (1.04%) chars / 12248997 (1.05%) bytes / 12905 (87.63%) shows SERBIAN: 157167 (0.16%) words / 979254 (0.15%) chars / 1719460 (0.15%) bytes / 8150 (55.34%) shows ENGLISH: 163237 (0.17%) words / 917167 (0.14%) chars / 937436 (0.08%) bytes / 12126 (82.34%) shows UKRAINIAN: 32412 (0.03%) words / 202820 (0.03%) chars / 356980 (0.03%) bytes / 1738 (11.80%) shows GERMAN: 10442 (0.01%) words / 66248 (0.01%) chars / 67249 (0.01%) bytes / 699 (4.75%) shows LATIN: 6542 (0.01%) words / 63741 (0.01%) chars / 64200 (0.01%) bytes / 3410 (23.16%) shows BULGARIAN: 8839 (0.01%) words / 58917 (0.01%) chars / 96067 (0.01%) bytes / 1530 (10.39%) shows BELARUSIAN: 8134 (0.01%) words / 52834 (0.01%) chars / 92709 (0.01%) bytes / 959 (6.51%) shows ITALIAN: 8084 (0.01%) words / 46521 (0.01%) chars / 47163 (0.00%) bytes / 309 (2.10%) shows DANISH: 5398 (0.01%) words / 38360 (0.01%) chars / 38932 (0.00%) bytes / 2564 (17.41%) shows FRENCH: 5651 (0.01%) words / 31860 (0.00%) chars / 32633 (0.00%) bytes / 453 (3.08%) shows SPANISH: 4077 (0.00%) words / 22539 (0.00%) chars / 23091 (0.00%) bytes / 321 (2.18%) shows CZECH: 3408 (0.00%) words / 19342 (0.00%) chars / 19658 (0.00%) bytes / 1006 (6.83%) shows ROMANIAN: 2422 (0.00%) words / 16816 (0.00%) chars / 23683 (0.00%) bytes / 895 (6.08%) shows MACEDONIAN: 2305 (0.00%) words / 16585 (0.00%) chars / 27233 (0.00%) bytes / 984 (6.68%) shows BASQUE: 2268 (0.00%) words / 16182 (0.00%) chars / 17030 (0.00%) bytes / 1377 (9.35%) shows DUTCH: 3452 (0.00%) words / 15899 (0.00%) chars / 16358 (0.00%) bytes / 2244 (15.24%) shows INDONESIAN: 2393 (0.00%) words / 13246 (0.00%) chars / 13515 (0.00%) bytes / 1346 (9.14%) shows NORWEGIAN_N: 2155 (0.00%) words / 13001 (0.00%) chars / 13361 (0.00%) bytes / 776 (5.27%) shows SCOTS: 1588 (0.00%) words / 11725 (0.00%) chars / 11825 (0.00%) bytes / 819 (5.56%) shows SCOTS_GAELIC: 2575 (0.00%) words / 11193 (0.00%) chars / 11347 (0.00%) bytes / 782 (5.31%) shows ARABIC: 2020 (0.00%) words / 10920 (0.00%) chars / 19732 (0.00%) bytes / 30 (0.20%) shows MAURITIAN_CREOLE: 1257 (0.00%) words / 10832 (0.00%) chars / 11444 (0.00%) bytes / 457 (3.10%) shows FAROESE: 1571 (0.00%) words / 10778 (0.00%) chars / 11102 (0.00%) bytes / 469 (3.18%) shows UZBEK: 1375 (0.00%) words / 9835 (0.00%) chars / 13260 (0.00%) bytes / 650 (4.41%) shows TAJIK: 1433 (0.00%) words / 9783 (0.00%) chars / 15135 (0.00%) bytes / 482 (3.27%) shows LATVIAN: 1194 (0.00%) words / 8400 (0.00%) chars / 8586 (0.00%) bytes / 678 (4.60%) shows GALICIAN: 1669 (0.00%) words / 8349 (0.00%) chars / 8411 (0.00%) bytes / 1157 (7.86%) shows X_PIG_LATIN: 1139 (0.00%) words / 8266 (0.00%) chars / 8351 (0.00%) bytes / 675 (4.58%) shows ESTONIAN: 1165 (0.00%) words / 7720 (0.00%) chars / 8455 (0.00%) bytes / 354 (2.40%) shows BASHKIR: 1011 (0.00%) words / 7713 (0.00%) chars / 12748 (0.00%) bytes / 504 (3.42%) shows NORWEGIAN: 1150 (0.00%) words / 7289 (0.00%) chars / 7486 (0.00%) bytes / 746 (5.07%) shows GEORGIAN: 1022 (0.00%) words / 7214 (0.00%) chars / 18884 (0.00%) bytes / 41 (0.28%) shows BRETON: 1193 (0.00%) words / 6735 (0.00%) chars / 6845 (0.00%) bytes / 784 (5.32%) shows LUXEMBOURGISH: 1118 (0.00%) words / 6710 (0.00%) chars / 6886 (0.00%) bytes / 865 (5.87%) shows VOLAPUK: 1627 (0.00%) words / 6634 (0.00%) chars / 6701 (0.00%) bytes / 490 (3.33%) shows PORTUGUESE: 1363 (0.00%) words / 6450 (0.00%) chars / 6662 (0.00%) bytes / 627 (4.26%) shows IRISH: 1712 (0.00%) words / 6441 (0.00%) chars / 6455 (0.00%) bytes / 1084 (7.36%) shows KHASI: 1161 (0.00%) words / 6199 (0.00%) chars / 6225 (0.00%) bytes / 519 (3.52%) shows MONGOLIAN: 927 (0.00%) words / 5993 (0.00%) chars / 9127 (0.00%) bytes / 418 (2.84%) shows TATAR: 945 (0.00%) words / 5802 (0.00%) chars / 7884 (0.00%) bytes / 526 (3.57%) shows TURKISH: 962 (0.00%) words / 5687 (0.00%) chars / 6160 (0.00%) bytes / 417 (2.83%) shows HUNGARIAN: 746 (0.00%) words / 5475 (0.00%) chars / 5522 (0.00%) bytes / 435 (2.95%) shows WELSH: 787 (0.00%) words / 5293 (0.00%) chars / 5524 (0.00%) bytes / 474 (3.22%) shows SANSKRIT: 557 (0.00%) words / 5286 (0.00%) chars / 5312 (0.00%) bytes / 322 (2.19%) shows RHAETO_ROMANCE: 1157 (0.00%) words / 5235 (0.00%) chars / 5906 (0.00%) bytes / 860 (5.84%) shows KYRGYZ: 805 (0.00%) words / 5150 (0.00%) chars / 8005 (0.00%) bytes / 333 (2.26%) shows POLISH: 694 (0.00%) words / 4789 (0.00%) chars / 5039 (0.00%) bytes / 157 (1.07%) shows SLOVAK: 715 (0.00%) words / 4710 (0.00%) chars / 5045 (0.00%) bytes / 485 (3.29%) shows BISLAMA: 548 (0.00%) words / 4329 (0.00%) chars / 4401 (0.00%) bytes / 384 (2.61%) shows ESPERANTO: 565 (0.00%) words / 4280 (0.00%) chars / 4515 (0.00%) bytes / 359 (2.44%) shows HAUSA: 571 (0.00%) words / 4104 (0.00%) chars / 4152 (0.00%) bytes / 319 (2.17%) shows BOSNIAN: 669 (0.00%) words / 4067 (0.00%) chars / 4167 (0.00%) bytes / 31 (0.21%) shows TURKMEN: 551 (0.00%) words / 4037 (0.00%) chars / 6606 (0.00%) bytes / 267 (1.81%) shows LITHUANIAN: 597 (0.00%) words / 4012 (0.00%) chars / 4190 (0.00%) bytes / 214 (1.45%) shows YORUBA: 617 (0.00%) words / 3863 (0.00%) chars / 4637 (0.00%) bytes / 464 (3.15%) shows SESELWA: 548 (0.00%) words / 3846 (0.00%) chars / 4011 (0.00%) bytes / 298 (2.02%) shows AKAN: 656 (0.00%) words / 3706 (0.00%) chars / 3729 (0.00%) bytes / 153 (1.04%) shows MALAGASY: 477 (0.00%) words / 3682 (0.00%) chars / 3745 (0.00%) bytes / 302 (2.05%) shows RUNDI: 422 (0.00%) words / 3564 (0.00%) chars / 3574 (0.00%) bytes / 254 (1.72%) shows MAORI: 651 (0.00%) words / 3531 (0.00%) chars / 3787 (0.00%) bytes / 302 (2.05%) shows KAZAKH: 465 (0.00%) words / 3356 (0.00%) chars / 5888 (0.00%) bytes / 103 (0.70%) shows ALBANIAN: 348 (0.00%) words / 3236 (0.00%) chars / 3467 (0.00%) bytes / 310 (2.11%) shows AYMARA: 358 (0.00%) words / 3071 (0.00%) chars / 3255 (0.00%) bytes / 258 (1.75%) shows MALAY: 460 (0.00%) words / 3012 (0.00%) chars / 3103 (0.00%) bytes / 311 (2.11%) shows FINNISH: 435 (0.00%) words / 2972 (0.00%) chars / 3054 (0.00%) bytes / 256 (1.74%) shows ABKHAZIAN: 362 (0.00%) words / 2935 (0.00%) chars / 5123 (0.00%) bytes / 111 (0.75%) shows ICELANDIC: 239 (0.00%) words / 2924 (0.00%) chars / 3021 (0.00%) bytes / 177 (1.20%) shows ARMENIAN: 489 (0.00%) words / 2913 (0.00%) chars / 5172 (0.00%) bytes / 17 (0.12%) shows HMONG: 297 (0.00%) words / 2732 (0.00%) chars / 3177 (0.00%) bytes / 238 (1.62%) shows SWEDISH: 376 (0.00%) words / 2698 (0.00%) chars / 2730 (0.00%) bytes / 235 (1.60%) shows SLOVENIAN: 307 (0.00%) words / 2653 (0.00%) chars / 2694 (0.00%) bytes / 231 (1.57%) shows GREENLANDIC: 279 (0.00%) words / 2419 (0.00%) chars / 2498 (0.00%) bytes / 176 (1.20%) shows CROATIAN: 395 (0.00%) words / 2403 (0.00%) chars / 2433 (0.00%) bytes / 183 (1.24%) shows SOMALI: 340 (0.00%) words / 2314 (0.00%) chars / 2408 (0.00%) bytes / 203 (1.38%) shows AFAR: 387 (0.00%) words / 2265 (0.00%) chars / 2431 (0.00%) bytes / 233 (1.58%) shows UIGHUR: 298 (0.00%) words / 2150 (0.00%) chars / 3477 (0.00%) bytes / 162 (1.10%) shows AFRIKAANS: 243 (0.00%) words / 2045 (0.00%) chars / 2064 (0.00%) bytes / 151 (1.03%) shows SWAHILI: 318 (0.00%) words / 2039 (0.00%) chars / 2085 (0.00%) bytes / 202 (1.37%) shows Chinese: 971 (0.00%) words / 2004 (0.00%) chars / 3913 (0.00%) bytes / 18 (0.12%) shows MANX: 275 (0.00%) words / 1850 (0.00%) chars / 1871 (0.00%) bytes / 168 (1.14%) shows VIETNAMESE: 314 (0.00%) words / 1832 (0.00%) chars / 1977 (0.00%) bytes / 186 (1.26%) shows INTERLINGUE: 328 (0.00%) words / 1814 (0.00%) chars / 1831 (0.00%) bytes / 234 (1.59%) shows OROMO: 187 (0.00%) words / 1579 (0.00%) chars / 1608 (0.00%) bytes / 110 (0.75%) shows MALTESE: 166 (0.00%) words / 1504 (0.00%) chars / 1519 (0.00%) bytes / 118 (0.80%) shows CATALAN: 221 (0.00%) words / 1463 (0.00%) chars / 1531 (0.00%) bytes / 123 (0.84%) shows JAVANESE: 196 (0.00%) words / 1461 (0.00%) chars / 1484 (0.00%) bytes / 125 (0.85%) shows FRISIAN: 170 (0.00%) words / 1439 (0.00%) chars / 1487 (0.00%) bytes / 106 (0.72%) shows CORSICAN: 216 (0.00%) words / 1438 (0.00%) chars / 1459 (0.00%) bytes / 105 (0.71%) shows SESOTHO: 235 (0.00%) words / 1432 (0.00%) chars / 1451 (0.00%) bytes / 139 (0.94%) shows SAMOAN: 217 (0.00%) words / 1354 (0.00%) chars / 1396 (0.00%) bytes / 152 (1.03%) shows HINDI: 294 (0.00%) words / 1353 (0.00%) chars / 3340 (0.00%) bytes / 15 (0.10%) shows INUPIAK: 198 (0.00%) words / 1352 (0.00%) chars / 1356 (0.00%) bytes / 127 (0.86%) shows QUECHUA: 199 (0.00%) words / 1330 (0.00%) chars / 1358 (0.00%) bytes / 133 (0.90%) shows HEBREW: 234 (0.00%) words / 1304 (0.00%) chars / 2282 (0.00%) bytes / 8 (0.05%) shows TONGA: 164 (0.00%) words / 1276 (0.00%) chars / 1351 (0.00%) bytes / 120 (0.81%) shows X_KLINGON: 184 (0.00%) words / 1234 (0.00%) chars / 1256 (0.00%) bytes / 130 (0.88%) shows AZERBAIJANI: 185 (0.00%) words / 1233 (0.00%) chars / 1394 (0.00%) bytes / 46 (0.31%) shows SHONA: 207 (0.00%) words / 1201 (0.00%) chars / 1246 (0.00%) bytes / 143 (0.97%) shows INTERLINGUA: 180 (0.00%) words / 1119 (0.00%) chars / 1140 (0.00%) bytes / 81 (0.55%) shows GREEK: 220 (0.00%) words / 1049 (0.00%) chars / 1855 (0.00%) bytes / 31 (0.21%) shows TAGALOG: 199 (0.00%) words / 1046 (0.00%) chars / 1101 (0.00%) bytes / 161 (1.09%) shows Japanese: 372 (0.00%) words / 1019 (0.00%) chars / 2289 (0.00%) bytes / 22 (0.15%) shows WOLOF: 203 (0.00%) words / 1018 (0.00%) chars / 1178 (0.00%) bytes / 162 (1.10%) shows GUARANI: 180 (0.00%) words / 1014 (0.00%) chars / 1047 (0.00%) bytes / 132 (0.90%) shows SUNDANESE: 154 (0.00%) words / 942 (0.00%) chars / 1006 (0.00%) bytes / 90 (0.61%) shows ChineseT: 176 (0.00%) words / 892 (0.00%) chars / 2196 (0.00%) bytes / 19 (0.13%) shows XHOSA: 136 (0.00%) words / 847 (0.00%) chars / 884 (0.00%) bytes / 82 (0.56%) shows LINGALA: 118 (0.00%) words / 794 (0.00%) chars / 825 (0.00%) bytes / 80 (0.54%) shows PERSIAN: 164 (0.00%) words / 786 (0.00%) chars / 1400 (0.00%) bytes / 9 (0.06%) shows TSWANA: 165 (0.00%) words / 731 (0.00%) chars / 787 (0.00%) bytes / 122 (0.83%) shows CEBUANO: 81 (0.00%) words / 702 (0.00%) chars / 725 (0.00%) bytes / 62 (0.42%) shows PEDI: 115 (0.00%) words / 691 (0.00%) chars / 697 (0.00%) bytes / 98 (0.67%) shows KINYARWANDA: 136 (0.00%) words / 624 (0.00%) chars / 636 (0.00%) bytes / 93 (0.63%) shows HAITIAN_CREOLE: 99 (0.00%) words / 497 (0.00%) chars / 517 (0.00%) bytes / 73 (0.50%) shows FIJIAN: 66 (0.00%) words / 492 (0.00%) chars / 519 (0.00%) bytes / 40 (0.27%) shows WARAY_PHILIPPINES: 68 (0.00%) words / 482 (0.00%) chars / 487 (0.00%) bytes / 39 (0.26%) shows HAWAIIAN: 78 (0.00%) words / 432 (0.00%) chars / 454 (0.00%) bytes / 53 (0.36%) shows OCCITAN: 67 (0.00%) words / 403 (0.00%) chars / 411 (0.00%) bytes / 47 (0.32%) shows Korean: 75 (0.00%) words / 295 (0.00%) chars / 683 (0.00%) bytes / 10 (0.07%) shows IGBO: 68 (0.00%) words / 277 (0.00%) chars / 284 (0.00%) bytes / 58 (0.39%) shows TSONGA: 55 (0.00%) words / 275 (0.00%) chars / 288 (0.00%) bytes / 43 (0.29%) shows NAURU: 38 (0.00%) words / 261 (0.00%) chars / 265 (0.00%) bytes / 28 (0.19%) shows THAI: 63 (0.00%) words / 239 (0.00%) chars / 589 (0.00%) bytes / 2 (0.01%) shows ZHUANG: 35 (0.00%) words / 212 (0.00%) chars / 216 (0.00%) bytes / 20 (0.14%) shows GANDA: 32 (0.00%) words / 191 (0.00%) chars / 197 (0.00%) bytes / 21 (0.14%) shows SINHALESE: 34 (0.00%) words / 191 (0.00%) chars / 495 (0.00%) bytes / 1 (0.01%) shows ZULU: 32 (0.00%) words / 188 (0.00%) chars / 202 (0.00%) bytes / 24 (0.16%) shows VENDA: 40 (0.00%) words / 138 (0.00%) chars / 143 (0.00%) bytes / 34 (0.23%) shows SISWANT: 15 (0.00%) words / 104 (0.00%) chars / 106 (0.00%) bytes / 14 (0.10%) shows NYANJA: 13 (0.00%) words / 95 (0.00%) chars / 95 (0.00%) bytes / 8 (0.05%) shows BENGALI: 17 (0.00%) words / 91 (0.00%) chars / 123 (0.00%) bytes / 9 (0.06%) shows BIHARI: 17 (0.00%) words / 75 (0.00%) chars / 177 (0.00%) bytes / 3 (0.02%) shows KHMER: 8 (0.00%) words / 39 (0.00%) chars / 101 (0.00%) bytes / 1 (0.01%) shows GUJARATI: 7 (0.00%) words / 29 (0.00%) chars / 75 (0.00%) bytes / 1 (0.01%) shows SINDHI: 5 (0.00%) words / 22 (0.00%) chars / 31 (0.00%) bytes / 2 (0.01%) shows MARATHI: 2 (0.00%) words / 16 (0.00%) chars / 40 (0.00%) bytes / 1 (0.01%) shows SANGO: 1 (0.00%) words / 11 (0.00%) chars / 11 (0.00%) bytes / 1 (0.01%) shows TOTAL: 98582971 words / 647367929 chars / 1166735398 bytes / 14726 shows