The GDELT Project

The Perils Of LLMs For Translation Tasks On Lower-Resource Languages: Estonian Noun Declension

One of the great promises of large language models (LLMs) is their ability to revolutionize translation and linguistic tasks. A common challenge of at-scale multilingual analysis is the need to perform high-recall low-precision prefiltering before passing content to more expensive analytic methods. For example, an LLM-based workflow interested only in electric vehicles would not want to run an LLM over the full contents of millions of articles per day in every language and on every topic imaginable given that only a small percentage of those articles will be relevant: the monetary and compute cost would be too high. Instead, prefiltering is used to identify any articles mentioning related topics and only articles mentioning those topics are passed on to the LLM. While a percentage of those will represent casual mentions or be otherwise unrelated, this dramatically reduces the volume of content that must be examined.

A modern workflow will typically make use of embeddings for topical searches, but current multilingual embedding models still have fairly poor coverage outside of English and a handful of other languages. While the models may be advertised as supporting hundreds of languages, their actual ability to encode topics in similar semantic spaces across languages tends to degrade sharply outside of top X languages.

More to the point, however, often the interest is in a specific entity, such as the name of a company, location, disease, mineral or other specific name. Embedding searches are less relevant to such searches, since they are designed by nature to abstract away from specific names and often separate related companies sufficiently poorly to adequately distinguish them. This is where keyword search is used. Machine translation can be used to translate a given name into the most common form in each language, which we've demonstrated for use cases including disease and economic issues, but for morphologically rich languages this captures just one of many possible representations.

Could LLMs offer a solution?

Let's start with Estonian, which uses noun declension. Despite grammatical rules governing declension, for "lower-frequency items or novel nouns, we find variable declension even among adult speakers." This represents a classic conundrum in linguistics: the difference between the authoritative grammatical "rules" of a language and how it is actually used in practice.

For example, in theory, the name "New York" could be represented in Estonian as “New York”, “New Yorki” , “New Yorgi”, “New Yorgisse”, “New Yorgis”, “New Yorgist”, “New Yorgile”, “New Yorgil”, “New Yorgilt”, “New Yorgiks”, “New Yorgini”, “New Yorgina”, “New Yorgita”, and “New Yorgiga”. The Estonian House in New York City is "New Yorgi Eesti Maja", while this Foreign Ministry page uses "New Yorgini" and this news article uses four different forms in a single article: "New Yorgini," "New Yorgi," "New Yorgis" and "New Yorki." Capturing all of these various forms of a name across all of the world's languages requires intricately detailed knowledge of each language.

What if we just asked an LLM to give us the list of terms? Let's try ChatGPT (GPT-3.5):

Provide a list of all of the Estonian words for "New York". Just provide a list of words separated by commas.

Run five different times we get:

What if we explicitly ask the LLM about declension:

Provide a list of all of the Estonian words for "New York" given noun declension. Just provide a list of words separated by commas.

This yields:

This offers a key reminder that for lower-resource languages like Estonian, despite having much greater representation due to being a European Union language, LLMs may struggle on translation and linguistic-related tasks such as keyword expansion.