One of the great promises of large language models (LLMs) is their ability to revolutionize translation and linguistic tasks. A common challenge of at-scale multilingual analysis is the need to perform high-recall low-precision prefiltering before passing content to more expensive analytic methods. For example, an LLM-based workflow interested only in electric vehicles would not want to run an LLM over the full contents of millions of articles per day in every language and on every topic imaginable given that only a small percentage of those articles will be relevant: the monetary and compute cost would be too high. Instead, prefiltering is used to identify any articles mentioning related topics and only articles mentioning those topics are passed on to the LLM. While a percentage of those will represent casual mentions or be otherwise unrelated, this dramatically reduces the volume of content that must be examined.
A modern workflow will typically make use of embeddings for topical searches, but current multilingual embedding models still have fairly poor coverage outside of English and a handful of other languages. While the models may be advertised as supporting hundreds of languages, their actual ability to encode topics in similar semantic spaces across languages tends to degrade sharply outside of top X languages.
More to the point, however, often the interest is in a specific entity, such as the name of a company, location, disease, mineral or other specific name. Embedding searches are less relevant to such searches, since they are designed by nature to abstract away from specific names and often separate related companies sufficiently poorly to adequately distinguish them. This is where keyword search is used. Machine translation can be used to translate a given name into the most common form in each language, which we've demonstrated for use cases including disease and economic issues, but for morphologically rich languages this captures just one of many possible representations.
Could LLMs offer a solution?
Let's start with Estonian, which uses noun declension. Despite grammatical rules governing declension, for "lower-frequency items or novel nouns, we find variable declension even among adult speakers." This represents a classic conundrum in linguistics: the difference between the authoritative grammatical "rules" of a language and how it is actually used in practice.
For example, in theory, the name "New York" could be represented in Estonian as “New York”, “New Yorki” , “New Yorgi”, “New Yorgisse”, “New Yorgis”, “New Yorgist”, “New Yorgile”, “New Yorgil”, “New Yorgilt”, “New Yorgiks”, “New Yorgini”, “New Yorgina”, “New Yorgita”, and “New Yorgiga”. The Estonian House in New York City is "New Yorgi Eesti Maja", while this Foreign Ministry page uses "New Yorgini" and this news article uses four different forms in a single article: "New Yorgini," "New Yorgi," "New Yorgis" and "New Yorki." Capturing all of these various forms of a name across all of the world's languages requires intricately detailed knowledge of each language.
What if we just asked an LLM to give us the list of terms? Let's try ChatGPT (GPT-3.5):
Provide a list of all of the Estonian words for "New York". Just provide a list of words separated by commas.
Run five different times we get:
- In Estonian, "New York" is typically referred to as "New York." The name is not typically translated into Estonian, and it is commonly used as is. So, there are no alternative Estonian words for "New York."
- In Estonian, "New York" is typically referred to as "New York" without significant variation. However, the pronunciation may differ slightly due to the Estonian language's phonetic rules. Here are the Estonian words for "New York": New York. Please note that place names like "New York" are often not translated into other languages but are used as is, with some minor phonetic adjustments for pronunciation in the target language.
- In Estonian, "New York" is typically referred to as "New York." There is no specific translation or alternate names for "New York" in Estonian; it is commonly used as is.
- In Estonian, "New York" is typically referred to as "New York." Unlike some languages that may have alternative names or translations for place names, "New York" remains the same in Estonian. Therefore, there are no alternative Estonian words for "New York."
- In Estonian, "New York" is typically referred to as "New York." The name is not translated into Estonian, so there are no alternative words for it in the Estonian language.
What if we explicitly ask the LLM about declension:
Provide a list of all of the Estonian words for "New York" given noun declension. Just provide a list of words separated by commas.
This yields:
- In Estonian, the name "New York" does not undergo noun declension because it is treated as a proper noun, and proper nouns do not change their form based on cases in Estonian grammar. Therefore, "New York" remains "New York" in all cases and does not have different forms.
- In Estonian, the name "New York" does not undergo noun declension, as it is typically used in its original form without changes. Therefore, there are no specific noun declensions for "New York" in Estonian. It remains "New York" regardless of its grammatical context.
This offers a key reminder that for lower-resource languages like Estonian, despite having much greater representation due to being a European Union language, LLMs may struggle on translation and linguistic-related tasks such as keyword expansion.