Wikipedia Normalization For the Classic Global Entity Graph (GEG) G1 Baseline Dataset

One of the powerful features of the neural Global Entity Graph (GEG) is its ability to normalize a wide range of name variants into their root entities. For example, the names "U.S. Federal Reserve," "Federal Reserve," "Federal Reserve Board," "New York Fed," "Atlanta Fed," "St. Louis Fed" and even just "The Fed" and "Fed" all resolve to the same root entity, allowing analyses to abstract away from specific references to the actual entities being discussed.

In contrast, the classical GEG G1 baseline dataset has up until now returned only the precise referent found in the text, making it hard to determine that a given set of names all refer to the same root entity.

Eagle-eyed viewers will have noticed that as of earlier this evening, the G1 baseline has added three new fields. The first is at the article level called simply "lang" and returns the language of the article, using the human-readable lable returned by CLD2. At this time, all articles will have a value of "ENGLISH" in this field, but adding this field brings the G1 baseline in line with the corresponding neural dataset and makes it easier for us to expand beyond English in the future.

The two most important changes, however, are two new entity-level fields.

The first is "nameNorm" and provides a grammatically normalized version of the entity name. This typically involves lowercasing the name and lemmatizing each component word of the name, including removing possessives. This normalized version still retains the actual name as it appeared in the article and thus does not link "The Fed" to the "Federal Reserve", but makes it easier to perform certain types of entity analyses.

The second new entity-level field is called "wikipediaEntry" and involves looking up the entity on the English edition of Wikipedia to see whether it has a corresponding Wikipedia entry or redirect. Each week GDELT downloads the latest dump of the English Wikipedia and constructs a lookup that encompasses all page titles and all intra-English redirects. Any entity that has a match has the corresponding page title returned in this field. It is important to note that in this initial version of Wikipedia linking, no additional contextual disambiguation is performed and thus there will be a certain degree of error in this field. In addition, Wikipedia pages that feature clarifying parentheses are excluded at this time. Entities without a corresponding Wikipedia page will not have this field.

The end result is that for an article like this AP wire story "Dallas dismissed from lawsuit over police shooting", a subset of entities contain this new "wikipediaEntry" field:

{ "wikipediaEntry": "Murder of Botham Jean",  "name": "Amber Guyger", "type": "PROPER", "nameNorm": "amber guyger", "numMentions": 3, "avgSalience": 0.52}
{ "wikipediaEntry": "Murder of Botham Jean",  "name": "Botham Jean", "type": "PROPER", "nameNorm": "botham jean", "numMentions": 2, "avgSalience": 0.43}
{ "wikipediaEntry": "United States federal judge",  "name": "federal judge", "type": "COMMON", "nameNorm": "federal judge", "numMentions": 1, "avgSalience": 0.43}
{ "wikipediaEntry": "Lawsuit",  "name": "civil lawsuit", "type": "COMMON", "nameNorm": "civil lawsuit", "numMentions": 1, "avgSalience": 0.36}
{ "wikipediaEntry": "Rule",  "name": "ruling", "type": "COMMON", "nameNorm": "rule", "numMentions": 1, "avgSalience": 0.32}
{ "wikipediaEntry": "Police brutality",  "name": "excessive force", "type": "COMMON", "nameNorm": "excessive force", "numMentions":1, "avgSalience": 0.26}
{ "wikipediaEntry": "Murder",  "name": "murder", "type": "COMMON", "nameNorm": "murder", "numMentions": 1, "avgSalience": 0.18}
{ "wikipediaEntry": "Saint Lucia",  "name": "St. Lucia", "type": "PROPER", "nameNorm": "st. lucia", "numMentions": 1, "avgSalience": 0.09}
{ "wikipediaEntry": "Caribbean",  "name": "Caribbean island", "type": "PROPER", "nameNorm": "caribbean island", "numMentions": 1, "avgSalience": 0.09}
{ "wikipediaEntry": "Ice cream",  "name": "ice cream", "type": "COMMON", "nameNorm": "ice cream", "numMentions": 1, "avgSalience": 0.02}

Entries in the list above like "Ice cream" and "murder" are exact matches, while "Caribbean" and "Saint Lucia" are name variants.

However, look more closely at the entities above several stand out. A mention in the article of "excessive force" is remapped courtesy of Wikipedia to "Police brutality", while mentions of "Amber Guyger" and "Bothan Jean" are both remapped to "Murder of Botham Jean". These translations are due to the incorporation of Wikipedia redirects. If you go to the English edition of Wikipedia and type in "excessive force", you will be redirected to the page on "Police brutality", while similarly, searches for "Amber Guyger" and "Bothan Jean" will cause you to be redirected to the page for "Murder of Botham Jean".

This live contextualization, updated weekly from the live Wikipedia, is a capability we are particularly interested in exploring for its ability to explicitly codify the key events and relationships drawing entities together.

We're very interested in your feedback on this new capability!