Generative AI Experiments: Why LLM-Based Geocoders Struggle

Over the past two days we've explored how advanced LLMs like GPT 4.0 struggle significantly with both candidate extraction and toponymic disambiguation. The former refers to the process of examining a block of text and identifying potential geographic mentions, while the latter refers to determining which of the (potentially myriad) places on earth that share that name the given mention refers to. For example, take the sentence "He went to school in Paris, just south of Chicago." Candidate extraction would identify "Paris" and "Chicago" as likely location mentions based on context, while toponymic disambiguation would determine that "Paris" in this case refers to Paris, Illinois, not the vastly more famous Paris, France, due to the context of it being "just south of Chicago".

In theory, LLMs should offer performance on both tasks that vastly exceeds contemporary classical geocoding systems due to the sheer magnitude of their training data. The reality is far less impressive.

LLMs should excel at disambiguation, since they've likely seen nearly every possible phrase to include the word "Paris" and in which contexts it was mentioned alongside French locations versus Illinois locations. In fact, in our own geographic modeling, we have frequently built disambiguation "mini models" that perform this kind of correlative contextual disambiguation for internal use cases. Unfortunately, disambiguation frequently requires true spatial reasoning and involves combinations of locations not commonly found in training data and thus even the most simplistic of formulations trivially understood by humans escape the ability of even the most advanced LLMs like GPT 4.0.

Candidate extraction is both simplified by the fixed universe of geographic location names in major gazetteers and complicated by the infinite number of potential contexts in which they can appear or be used in a non-geographic sense. LLMs should therefore be ideally suited for extractive tasks, since they've likely seen most every possible way a given name can appear in context, all the other ways that same context has been used for other geographic names and all the ways it has been used for non-geographic names and all of the various locative additions like administrative divisions and hyperlocalized forms.

In practice, advanced LLMs tend to exhibit extreme Western bias in their ability to identify location mentions, correctly extracting popular geographic names heavily represented in their training data and failing miserably at identifying more localized names less prevalent in the Western web. They also exhibit strong instability, changing the list of names they extract with each run and even excluding major capital cities like Moscow at random. Worse, they hallucinate, confidently asserting that cities were mentioned which are not even alluded to in the text.

This is actually the most problematic of the two: the inability of GPT 4.0 to robustly compile a list of potential location mentions means that we can't simply use GPT 4.0 to robustly comprehensively extract location mentions from a text and then use a traditional disambiguating geocoder to translate those into codified records. After all, if the LLM can miss more than half the location mentions in the text, including capital cities, hallucinates the presence of locations not actually mentioned in the text and conflates cities, counties and even unrelated locations all together, with different results each time it is run, then it doesn't much matter how powerful the downstream disambiguation engine is: it is limited to whatever location candidates are provided to it by the LLM.

To be clear, LLM-based geocoders can work and have been deployed by many organizations focused on mainly common locations in largely Western news coverage that are highly represented in the training data of the LLMs. There are also countless examples of them working well in one-off demonstrations, much as there are all things LLM. Yet, there is an immense difference between these contrived and trivial use cases versus the kind of robust production-grade global geocoding needed to understand world events at their earliest, local, glimmers. In our own at-scale global experimentation to date with both publicly accessible models and more advanced unpublished models, we have yet to find a model, prompting or workflow that yields production-ready results that meet our needs. Moreover, for the kinds of simplistic applications where LLM geocoders yield tolerable results, classical fulltext geocoders offer the same results at a fraction of the cost with many orders of magnitude speed increases, reasoning performance, robustness and completeness.

Rather than advanced LLMs, the area where we have long seen the most promising results, long predating public LLM awareness, is the use of small compact and exceptionally efficient language models trained explicitly on the geographic domain. We have long leveraged several such "mini models" to greatly enhance geographic extraction in a number of internal use cases and have found them to be vastly superior to the performance even of highly tuned general models – a trend not unnoticed by the broader community across many domains.