Using Wikipedia To Normalize The Classic Global Entity Graph (GEG) G1 Baseline

Last week we unveiled the new Wikipedia-normalized enrichment of the Classic Global Entity Graph (GEG) G1 Baseline, applying it retroactively to the full 2019 historical backfile and the live stream. This new enrichment takes each extracted entity and runs it against the English Wikipedia page title and redirects lists, recording any matches. The resulting system can recognize myriad name variants for major public figures such as all of the names used for Donald Trump and even map names and references to complex topics and evolving events. How does this process work?

In addition to the article titles that Wikipedia users are familiar with, there is a second invisible dataset that users don't directly see, but ensures that they are guided to the right entry when searching. Take the official page of the current president of the United States, titled simply "Donald Trump." Users to Wikipedia might instead type "President Trump" or "President Donald Trump" or any of myriad alternative and informal names. To maximize its usability by the public, Wikipedia must ensure that searches for all of these name variants result in the user being directed to the master "Donald Trump" page.

Wikipedia does this through the use of redirects, which map words and phrases to specific Wikipedia pages. These mappings are one-to-one, meaning a phrase must redirect to exactly one Wikipedia entry, though some may redirect to disambiguation pages. Thus, a search for the term "Cambridge" on Wikipedia will always return the city in England, rather than Massachusetts due to the site anticipating that the majority of unqualified searches for the city are likely intended for the British city.

The GEG searches for an exact match from each extracted entity to a Wikipedia page or redirect. Matches are currently required to be exact – honorific titles and contextualizing words are not removed at this time. In addition, no contextual disambiguation is performed – extracted entities are simply checked for an exact match as-is.

To correctly handle acronyms, which often repurpose common words that in lowercase have one meaning but in capitalized form refer to a specific organization or concept, the GEG uses a special ruleset to handle candidate capitalization. Each extracted entity is initially checked for an as-is match and, if no match is found, an as-is match with any possessives like "'s" removed. Proper names and acronyms (both all-caps and mixed-caps variants like "hoMe") are typically matched at this stage. If no match was found, the entity is converted to titlecase using Unicode rules and a match attempted.

While simplistic, this workflow yields a powerful new contextualizing signal, normalizing name variants and mapping names to the concepts and events they relate to. We're tremendously excited to see how you leverage it!