Improving Multilingual Entity Identification And Disambiguation

As we look to ways to help the NLP community move towards truly multilingual approaches, one area of especial interest in many news-related applications lies in more robust multilingual entity identification and disambiguation, especially in unusual contexts.

  • Better recognition of lesser-known entities. Most of today's production and research-grade systems perform quite well at extracting well-known entities even in uncertain and ambiguous contexts. Yet, they consistently fail on lesser-known names that don't appear in sources like Wikipedia. Common names are correctly identified regardless of context, while less frequent names are either not recognized at all or are only recognized in precise grammatically unambiguous contexts. This skews extracted entities towards the most well known names that makes emergent stories and connections more difficult to recognize.
  • Better recognition of names from around the world. Even some of the most widely-used commercial systems still to this day exhibit strong bias against names that are less common in the US and Europe, especially transliterated names that incorporate common English words. Unfortunately, despite immense growth in international training datasets, the transition from statistical to neural name recognition is actually regressing accuracy. The same issues hold for both human and organizational names. For example, startups with names in English or common European languages tend to be recognized at a far higher rate than names in languages like Chinese or Arabic, severely skewing analyses of topics like global entrepreneurship.
  • Disambiguation in unusual contexts. A related challenge is that extremely common names like Barack Obama will always be resolved, regardless of the context, to the former US president, even if they refer to an ordinary citizen who simply shares his name. This becomes even more complicated for extremely well known public figures whose names are shared with larger numbers of the public. While all modern production systems still attempt to disambiguate such names, this disambiguation task is typically in practice a no-op in that the models are so skewed to seeing the name exclusively refer to its best-known holder that it simply cannot resolve to any alternative result.

We'd love to see greater research in this space using the new 150-language Web News NGrams 3.0 dataset!