Paris, Georgia and Trump Fixes

We recently made three bug fixes to address two geographic issues and one person name extraction issue. Mapping the textual geography of the world's news media is an incredibly difficult task and contextual disambiguation plays a critical role in generating robust results. In some cases the geocoding infrastructure makes use of additional external domain knowledge beyond that contained in a given document to help it make sense of ambiguous locative references. For example, Well Known Places (WKPs) refer to locations that are frequently referenced without additional context under the assumption that most readers will automatically infer the correct location in the absence of additional clarifying information. In most articles a reference to "Parisian authorities yesterday…" would lead an ordinary reader, in the absence of additional contextual information, to presume the article was referring to Paris, France. In other cases, more complex algorithms are required to disambiguate whether "Washington lawmakers" refers to Washington state or Washington, DC or whether "Georgian citizens" refers to the US state or the country in Europe.

These kinds of special cases are handled by special collections of algorithms and engines within the geocoder core that have their own external domain knowledge stores they use for disambigation, contextual synthesis and spatial reasoning.

In the case of Paris, France, several users noticed that the latitude/longitude coordinates reported for the city were in fact slightly southeast of the actual city's location. It appears this was due to one of the domain geographic datasets containing a second uncaught entry of an second city named "Paris" in France that was not caught by our algorithmic screening when the dataset was loaded. In a second case, the US state of Georgia occasionally reported 0's for its latitude/longitude coordinate – this affected only mentions of the US state itself, not mentions of specific locations within.

Similarly, given the enormous variety of the world's person names and the need for GDELT to achieve reasonable accuracy recognizing names from every culture and location on planet earth, it can on certain occasions become confused with a particular name despite its design to err on the side of inclusion in edge cases. In the case of Donald Trump, the last name "trump" is sufficiently rare that the linguistic models used by the name identification engine, coupled with the grammatical and semantic contexts in which Trump's name is often used, frequently confused the person name extraction engine such that his name did not always appear in the Persons field. However, this is one of the reasons for the existence of the AllNames field, which captures all capitalized phrases in each document, and in all cases Donald Trump always appears in the AllNames field even when it is missing from the Persons field.

To fix these issues, the geographic domain stores were updated for Paris and Georgia and additional checks made of the remaining entries. In the case of Donald Trump, the underlying language models are augmented with a Well Known Person store and this catalog was vastly expanded by drawing from a wide array of other catalogs of known persons from each country of the world. All of these names, including Donald Trump, will now be recognized even if the primary linguistic or contextual models do not compute a sufficiently high match probability.

These issues have been fixed and the correct results will appear from here forward. We have gone back and updated all of the V1 and V2 Event and GKG tables in BigQuery for the Paris and Georgia issues (the missing Trump name match can already be found through the AllNames field and attempting to correct the Person field would create additional complications), so those using BigQuery for historical analyses will see the correct results for geographic analyses. Given the sheer massiveness of the CSV file archive, we do not yet have an ETA on when we will be able to correct the Paris and Georgia entries in the CSV historical backfile, but one can simply write a short PERL or Python or similar script to correct the entries or use BigQuery.