Shifting From Language Names To Language Codes

Kalev Leetaru

4 years ago

With the launch of Translingual 2.0 and GDELT's new in-house 440-language+script language detector, GDELT 3.0 will see a major expansion of GDELT's monitoring of news coverage in long-tail languages. Historically, GDELT has relied on CLD2 for its language detection needs and has used CLD2's spelled-out language name in the language field of many of its datasets, adopting its idiosyncrasies like "ChineseT" and the titlecased "Japanese" versus uppercased "THAI."

With GDELT's forthcoming shift to its own language detection infrastructure with its massive attendant increase in the number of languages it supports, GDELT's datasets will be transitioning in the coming months to use language codes, rather than language names in all fields that specify an article's language. A combination of ISO 639-1 and ISO 639-2 codes will be used, with 639-1 taking precedence and 639-2 being used where the given language does not have a 639-1 code or where the detector was able to identify the specific 639-1 language family, but was unable to narrow its detection to a specific 639-2 code.

The standardization from language names to language codes ensures specificity (a language family vs a specific language or dialect such as "Norwegian" vs "nn" and "nb") and sidesteps naming convention differences ("Burmese" vs "Myanmar" or "Odia" vs "Oriya").

Moving forward, applications processing the "language" field of any GDELT dataset should be designed to accurately handle either a CLD2 spelled out name, a two-character ISO 639-1 code or a three-character ISO 639-2 code. Processing logic should be flexible enough that as ISO standards evolve, the contents of these fields may change in step.