We are immensely excited to announce the unveiling of the first major piece of Translingual 2.0 and the new GDELT Language Services (GLS) infrastructure: a massive new language detection infrastructure that can identify more than 440 language-script combinations from across the world, ranging from international lingua francas to nearly extinct local dialects.
Since GDELT Translingual's inception in 2014, we have relied upon Google's open CLD2 language detection library for all of our language identification tasks. CLD2 supports 83 core languages, while we run it in extended mode, which expands it to 161 languages (175 language-script combinations) with an additional 65 script-to-language mappings, for a total of 240 language-script combinations.
While this served us well early on, as GDELT has continued to reach ever more locally across the globe, we've been increasingly limited by the selection of languages supported by CLD2, its reduced accuracy in extended mode, the inability to readily extend it to new languages and the limitations of its minified quadgram+octagram architecture that was designed to optimize for minimal memory consumption over language coverage. While CLD2 is a superb tool for the overwhelming majority of traditional language detection applications and highly accurate on the majority of languages that most users will ever encounter, GDELT's unique global lens and increasing push towards local news in long-tail languages and our need for larger language models to increase differentiation precision means we needed a fundamentally new approach and architecture to language detection. Most importantly, as we increasingly work with local language communities and NGOs to acquire or build machine translation resources for these long-tail languages, we need language detection services that can be updated in lock-step. After all, to deploy a translation model, we first need a language detection model that can robustly identify content in that language so we can forward it for translation.
Today we are incredibly excited to unveil the first results of our massive new initiative around global languages: our first inhouse language detection infrastructure that supports more than 440 language-script combinations and counting. The initial set of languages that have passed our final evaluation tests over the past few months can be seen below. We will be deploying this infrastructure internally in the coming weeks to evaluate it at production scale over time to gather final accuracy and differentiation performance statistics before completing the transition of our language processing workflows over to it in the transition to GDELT 3.0. During this process some of the languages below may change as we evaluate production performance over time, so consider this a working list of languages that have passed our initial tests.
Enormous effort has gone into compiling, researching and purifying training datasets for each of these languages, endless experimentation evaluating the right mix of sample data yielding the greatest differentiation from similar languages and truest representation of each language and research, experimentation and fine-tuning of a vast landscape of different modeling approaches to develop our final architecture that achieves the best accuracy and highest inference performance. The final inference process was designed to be extremely efficient with runtime linear to the article length, runs on ordinary CPUs and can be implemented in a highly optimized pipeline in any programming language and trades memory use for computational complexity, yielding a highly efficient runtime.
We are tremendously excited about this first step towards Translingual 2.0 and will be sharing a steady stream of updates in the weeks to come!
- Abkhazian (ab)
- Achinese (ace)
- Adangme (ada)
- Adilabad Gondi (Gunjala Gondi script) (wsg–gunjala_gondi)
- Adyghe (ady)
- Afrikaans (af)
- Ahom (Ahom script) (aho–ahom)
- Akan (ak)
- Albanian (sq)
- Albanian (Elbasan script) (sq–elbasan)
- Albanian (Vithkuqi script) (sq–vithkuqi)
- Alur (alz)
- Amharic (am)
- Ancient Egyptian (Egyptian Hieroglyphs script) (egy–egyptian_hieroglyphs)
- Ancient Greek (Cypriot script) (grc–cypriot)
- Arabic (ar)
- Aragonese (an)
- Armenian (hy)
- Armenian (Armenian script) (hy–armenian)
- Arpitan (frp)
- Assamese (as)
- Asturian (ast)
- Atayal (tay)
- Atikamekw (atj)
- Avaric (av)
- Avestan (Avestan script) (ae–avestan)
- Ayacucho Quechua (quy)
- Aymara (ay)
- Azerbaijani (Cyrillic script) (az–cyrillic)
- Azerbaijani (Latin script) (az–latin)
- Balinese (ban)
- Balinese (Balinese script) (ban–balinese)
- Bambara (bm)
- Bamun (Bamum script) (bax–bamum)
- Baoulé (bci)
- Bashkir (ba)
- Basque (eu)
- Bassa (Bassa Vah script) (bsq–bassa_vah)
- Batak Karo (btx)
- Batak Toba (bbc)
- Batak Toba (Batak script) (bbc–batak)
- Bavarian (bar)
- Belarusian (be)
- Belize Kriol English (bzj)
- Bemba (bem)
- Bengali (bn)
- Bishnupriya (bpy)
- Bislama (bi)
- Bosnian (bs)
- Breton (br)
- Buginese (bug)
- Buhid (Buhid script) (bku–buhid)
- Bulgarian (bg)
- Burmese (my)
- Carian (Carian script) (xcr–carian)
- Catalan (ca)
- Cebuano (ceb)
- Central Bikol (bcl)
- Central Kurdish (ckb)
- Chakma (Chakma script) (ccp–chakma)
- Chamorro (ch)
- Chechen (ce)
- Cherokee (chr)
- Cherokee (Cherokee script) (chr–cherokee)
- Cheyenne (chy)
- Chinese Simplified (zh)
- Chinese Traditional (zh-TW)
- Chol (ctu)
- Chorasmian (Chorasmian script) (xco–chorasmian)
- Church Slavic (cu)
- Chuvash (cv)
- Chuwabu (chw)
- Coptic (Coptic script) (cop–coptic)
- Cornish (kw)
- Corsican (co)
- Cree (cr)
- Crimean Tatar (crh)
- Croatian (hr)
- Cusco Quechua (quz)
- Czech (cs)
- Dagbani (dag)
- Danish (da)
- Dehu (dhv)
- Dhivehi (dv)
- Dhivehi (Dives Akuru script) (dv–dives_akuru)
- Dhivehi (Thaana script) (dv–thaana)
- Dimli (diq)
- Dinka (din)
- Dogri (Dogra script) (doi–dogra)
- Dogri (Takri script) (doi–takri)
- Dotyali (dty)
- Dutch (nl)
- Dzongkha (dz)
- Eastern Mari (mhr)
- Efik (efi)
- Egyptian Arabic (arz)
- Emilian-Romagnol (eml)
- English (en)
- Erzya (myv)
- Esperanto (eo)
- Estonian (et)
- Ewe (ee)
- Extremaduran (ext)
- Faroese (fo)
- Fiji Hindi (hif)
- Fijian (fj)
- Finnish (fi)
- French (fr)
- Friulian (fur)
- Fulah (ff)
- Fulah (Adlam script) (ff–adlam)
- Ga (gaa)
- Gagauz (gag)
- Galician (gl)
- Ganda (lg)
- Georgian (ka)
- German (de)
- Gilaki (glk)
- Gilbertese (gil)
- Gitonga (toh)
- Goan Konkani (gom)
- Gondi (Masaram Gondi script) (gon–masaram_gondi)
- Gorontalo (gor)
- Gothic (got)
- Gothic (Gothic script) (got–gothic)
- Greek (el)
- Guarani (gn)
- Guianese Creole French (gcr)
- Gujarati (gu)
- Gujarati (Gujarati script) (gu–gujarati)
- Gun (guw)
- Haitian (ht)
- Hakka Chinese (hak)
- Hausa (ha)
- Hawaiian (haw)
- Hebrew (he)
- Hieroglyphic Luwian (Anatolian Hieroglyphs script) (hlu–anatolian_hieroglyphs)
- Hiligaynon (hil)
- Hindi (hi)
- Hindi (Mahajani script) (hi–mahajani)
- Hiri Motu (ho)
- Hmong (hmn)
- Hmong (Pahawh Hmong script) (hmn–pahawh_hmong)
- Ho (Warang Citi script) (hoc–warang_citi)
- Hungarian (hu)
- Hungarian (Old Hungarian script) (hu–old_hungarian)
- Icelandic (is)
- Ido (io)
- Igbo (ig)
- Iloko (ilo)
- Imbabura Highland Quichua (qvi)
- Inari Sami (smn)
- Indonesian (id)
- Ingush (inh)
- Interlingue (ie)
- Inuktitut (iu)
- Inupiaq (ik)
- Irish (ga)
- Isoko (iso)
- Isthmus Zapotec (zai)
- Italian (it)
- Jamaican Creole English (jam)
- Japanese (ja)
- Javanese (jv)
- Javanese (Javanese script) (jv–javanese)
- Kabardian (kbd)
- Kabiyè (kbp)
- Kabyle (kab)
- Kalaallisut (kl)
- Kalmyk (xal)
- Kamba (kam)
- Kannada (kn)
- Kaonde (kqn)
- Kara-Kalpak (kaa)
- Karachay-Balkar (krc)
- Kashmiri (ks)
- Kashubian (csb)
- Kazakh (kk)
- Kekchí (kek)
- Khmer (km)
- Khmer (Khmer script) (km–khmer)
- Kikuyu (ki)
- Kinyarwanda (rw)
- Kirghiz (ky)
- Kitan (Khitan Small Script script) (zkt–khitan_small_script)
- Komi (kv)
- Komi (Old Permic script) (kv–old_permic)
- Komi-Permyak (koi)
- Kongo (kg)
- Korean (ko)
- Kosraean (kos)
- Kotava (avk)
- Krio (kri)
- Kuanyama (kj)
- Kurdish (ku)
- Kurdish (Yezidi script) (ku–yezidi)
- Kwangali (kwn)
- Ladino (lad)
- Lak (lbe)
- Lao (lo)
- Latgalian (ltg)
- Latin (la)
- Latvian (lv)
- Lepcha (Lepcha script) (lep–lepcha)
- Lezghian (lez)
- Lezghian (Caucasian Albanian script) (lez–caucasian_albanian)
- Ligurian (lij)
- Limbu (Limbu script) (lif–limbu)
- Lingala (ln)
- Lisu (Lisu script) (lis–lisu)
- Lithuanian (lt)
- Livvi (olo)
- Lojban (jbo)
- Lower Sorbian (dsb)
- Lozi (loz)
- Luba-Katanga (lu)
- Luba-Lulua (lua)
- Lunda (lun)
- Luo (luo)
- Luvale (lue)
- Luxembourgish (lb)
- Lycian (Lycian script) (xlc–lycian)
- Lydian (Lydian script) (xld–lydian)
- Lü (New Tai Lue script) (khb–new_tai_lue)
- Macedo-Romanian (rup)
- Macedonian (mk)
- Madurese (mad)
- Maithili (mai)
- Maithili (Tirhuta script) (mai–tirhuta)
- Makasar (Makasar script) (mak–makasar)
- Makhuwa (vmw)
- Malagasy (mg)
- Malay (ms)
- Malayalam (ml)
- Malayalam (Malayalam script) (ml–malayalam)
- Maltese (mt)
- Mandaic (Mandaic script) (mid–mandaic)
- Manipuri (mni)
- Manipuri (Meetei Mayek script) (mni–meetei_mayek)
- Manx (gv)
- Maori (mi)
- Marathi (mr)
- Marathi (Modi script) (mr–modi)
- Marshallese (mh)
- Mazanderani (mzn)
- Medefaidrin (Medefaidrin script) (dmf–medefaidrin)
- Mende (Mende Kikakui script) (men–mende_kikakui)
- Meroitic (Meroitic Cursive script) (xmr–meroitic_cursive)
- Meroitic (Meroitic Hieroglyphs script) (xmr–meroitic_hieroglyphs)
- Metlatónoc Mixtec (mxv)
- Min Dong Chinese (cdo)
- Minangkabau (min)
- Mingrelian (xmf)
- Moksha (mdf)
- Mon (mnw)
- Mongolian (mn)
- Mongolian (Mongolian script) (mn–mongolian)
- Moroccan Arabic (ary)
- Mossi (mos)
- Mru (Mro script) (mro–mro)
- N'Ko (nqo)
- Nauru (na)
- Navajo (nv)
- Ndau (ndc)
- Ndonga (ng)
- Neapolitan (nap)
- Nepali (ne)
- Newari (new)
- Newari (Newa script) (new–newa)
- Nias (nia)
- Northern Frisian (frr)
- Northern Sami (se)
- Norwegian (no)
- Norwegian Nynorsk (nn)
- Nyaneka (nyk)
- Nyanja (ny)
- Nyungwe (nyu)
- Nzima (nzi)
- Occitan (oc)
- Official Aramaic (arc)
- Official Aramaic (Imperial Aramaic script) (arc–imperial_aramaic)
- Official Aramaic (Palmyrene script) (arc–palmyrene)
- Old English (ang)
- Old Persian (Old Persian script) (peo–old_persian)
- Old Uighur (Old Turkic script) (oui–old_turkic)
- Old Uighur (Old Uyghur script) (oui–old_uyghur)
- Oriya (or)
- Oromo (om)
- Osage (Osage script) (osa–osage)
- Ossetian (os)
- Pahlavi (Inscriptional Pahlavi script) (pal–inscriptional_pahlavi)
- Pahlavi (Psalter Pahlavi script) (pal–psalter_pahlavi)
- Pali (pi)
- Pampanga (pam)
- Pangasinan (pag)
- Panjabi (pa)
- Papantla Totonac (top)
- Papiamento (pap)
- Paraguayan Guaraní (gug)
- Parthian (Inscriptional Parthian script) (xpr–inscriptional_parthian)
- Pedi (nso)
- Persian (fa)
- Phoenician (Phoenician script) (phn–phoenician)
- Piemontese (pms)
- Pijin (pis)
- Pitcairn-Norfolk (pih)
- Pohnpeian (pon)
- Polish (pl)
- Portuguese (pt)
- Primitive Irish (Ogham script) (pgl–ogham)
- Pushto (ps)
- Quechua (qu)
- Quechua (que)
- Rejang (Rejang script) (rej–rejang)
- Rohingya (Hanifi Rohingya script) (rhg–hanifi_rohingya)
- Romanian (ro)
- Romansh (rm)
- Ronga (rng)
- Rundi (run)
- Russia Buriat (bxr)
- Russian (ru)
- Rusyn (rue)
- S'gaw Karen (ksw)
- Sakizaya (szy)
- Samoan (sm)
- Samogitian (sgs)
- San Salvador Kongo (kwy)
- Sango (sg)
- Sanskrit (sa)
- Sanskrit (Bhaiksuki script) (sa–bhaiksuki)
- Sanskrit (Grantha script) (sa–grantha)
- Sanskrit (Sharada script) (sa–sharada)
- Sanskrit (Siddham script) (sa–siddham)
- Santali (sat)
- Santali (Ol Chiki script) (sat–ol_chiki)
- Saraiki (skr)
- Saraiki (Multani script) (skr–multani)
- Sardinian (sc)
- Saterfriesisch (stq)
- Saurashtra (Saurashtra script) (saz–saurashtra)
- Scots (sco)
- Scottish Gaelic (gd)
- Sena (seh)
- Serbian (Cyrillic script) (sr–cyrillic)
- Serbian (Latin script) (sr–latin)
- Seselwa Creole French (crs)
- Shan (shn)
- Shona (sn)
- Sichuan Yi (Yi script) (ii–yi)
- Sicilian (scn)
- Sindhi (sd)
- Sindhi (Khudawadi script) (sd–khudawadi)
- Sinhala (si)
- Slovak (sk)
- Slovenian (sl)
- Sogdian (Old Sogdian script) (sog–old_sogdian)
- Sogdian (Sogdian script) (sog–sogdian)
- Somali (so)
- Somali (Osmanya script) (so–osmanya)
- Sora (Sora Sompeng script) (srb–sora_sompeng)
- South Azerbaijani (azb)
- Southern Altai (alt)
- Southern Kisi (kss)
- Southern Sotho (st)
- Spanish (es)
- Sranan Tongo (srn)
- Sundanese (su)
- Sundanese (Sundanese script) (su–sundanese)
- Swahili (sw)
- Swati (ss)
- Swedish (sv)
- Sylheti (Syloti Nagri script) (syl–syloti_nagri)
- Tachelhit (shi)
- Tagalog (tl)
- Tagalog (Tagalog script) (tl–tagalog)
- Tagbanwa (Tagbanwa script) (tbw–tagbanwa)
- Tahitian (ty)
- Tai Dam (Tai Viet script) (blt–tai_viet)
- Tai Nüa (Tai Le script) (tdd–tai_le)
- Tajik (tg)
- Tamil (ta)
- Tangut (Tangut script) (txg–tangut)
- Tarantino (roa-tara)
- Taroko (trv)
- Tase Naga (Tangsa script) (nst–tangsa)
- Tatar (tt)
- Telugu (te)
- Tetela (tll)
- Tetun Dili (tdt)
- Thai (th)
- Tibetan (bo)
- Tigrinya (ti)
- Tiv (tiv)
- Tojolabal (toj)
- Tok Pisin (tpi)
- Tonga (Nyasa) (tog)
- Tonga (Tonga Islands) (to)
- Tonga (Zambia & Zimbabwe) (toi)
- Tosk Albanian (als)
- Toto (Toto script) (txo–toto)
- Tsonga (ts)
- Tswa (tsc)
- Tswana (tn)
- Tulu (tcy)
- Tumbuka (tum)
- Turkish (tr)
- Turkmen (tk)
- Tuvinian (tyv)
- Twi (tw)
- Tzotzil (tzo)
- Udmurt (udm)
- Ugaritic (Ugaritic script) (uga–ugaritic)
- Uighur (ug)
- Uighur (Cyrillic script) (ug–cyrillic)
- Ukrainian (uk)
- Umbundu (umb)
- Upper Sorbian (hsb)
- Urdu (ur)
- Uzbek (Cyrillic script) (uz–cyrillic)
- Uzbek (Latin script) (uz–latin)
- Vai (Vai script) (vai–vai)
- Venda (ve)
- Veps (vep)
- Vietnamese (vi)
- Vlaams (vls)
- Volapük (vo)
- Võro (vro)
- Wallisian (wls)
- Walloon (wa)
- Wancho Naga (Wancho script) (nnp–wancho)
- Waray (war)
- Welsh (cy)
- Western Frisian (fy)
- Western Mari (mrj)
- Western Panjabi (pnb)
- Wolof (wo)
- Xhosa (xh)
- Yakut (sah)
- Yapese (yap)
- Yiddish (yi)
- Yoruba (yo)
- Yucateco (yua)
- Zhuang (za)
- Zulu (zu)