Custom Entity Extraction Over The News Using Web NGrams 3.0

With the new Web NGrams 3.0 dataset it is now possible to run your own custom entity extraction algorithms over global news media to semantically understand and search the news!

Performing entity extraction using Web NGrams 3.0 is a fairly simple task:

  1. For each record, concatenate the pre, ngram and post fields together, using a space between the three fields for languages of a "type" value of 1 (space-segmented languages) and no space for type 2 languages (scriptio continua).
  2. Run your entity extraction algorithm over the resulting snippet of text. Remember that since GDELT covers 152 languages, you will need to filter the NGrams dataset to only those languages your toolkit supports.
  3. Discard any entities that appear only at the start or end of the snippet, since those may have been truncated. For example, a snippet might begin "House today the president announced…" for which your entity extraction algorithm might extract "House" as an entity. In reality, however, the actual entity in the sentence was "White House" but it was truncated in the snippet. Thus, by skipping entities at the beginning and end of the snippet, truncated entities are avoided. On the other hand, if an entity appears at the beginning of the snippet, but also later in the snippet, it should be included. For example, in the snippet "Biden announced today. Biden also said that…," the entity "Biden" appears at the start of the snippet, so would ordinarily be skipped, but since it also appears later in the snippet, it is retained. Note that skipping the entities at the start and end of the snippet does not lead to loss of any entities, since those entities will appear in their full form in other snippets.
  4. Compile the final set of entities by URL, potentially incorporating their "pos" decile information and number of appearances as a form of salience ranking. Remember that since snippets are computed using a rolling window over the text, you might want to normalize for the length of the entity, since a one-word entity will appear in more snippets than a two-word entity by virtue of the rolling window, though this can inadvertently skew your results towards longer entities and thus represents a tradeoff.

To showcase this workflow in action, we've created a simple Perl script that does all of this. Download "demo-entityextract.pl" and make it executable, as well as installing several additional Perl modules and supporting tools:

chmod 755 ./demo-entityextract.pl
apt-get -y install pigz
apt-get -y install curl
apt-get -y install libjson-xs-perl
apt-get -y install liblingua-en-tagger-perl

Now run the script each minute:

./demo-entityextract.pl

It will automatically download the latest Web NGrams 3.0 and GDELT Article List files, compile their entities and write to the "./RESULTS/" subdirectory. Remember that not all minutes have data and thus you will typically see clusters of output files every 15 minutes, with gaps in between.

You can set this up to run in realtime each minute by following the cron instructions in our tutorial on keyword searching from earlier this week.

This particular demo uses the Lingua::EN::Tagger Perl module, which uses a Hidden Markov Model POS tagger. You could reimplement this workflow in any language, such as using spaCy to additional label entities by type and add support for additional languages, as well as neural support.

The core workflow involves using Lingua::EN::Tagger to compile all of the nounphrases from the given snippet, discarding those that appear only at the start or end of the snippet and counting how many snippets each appears in. Rather than the raw snippet count, the inverse decile position is used, meaning that two mentions in the lead paragraph of an article are ranked more highly than two mentions at the bottom of the article, in keeping with journalism's inverted pyramid structure.

As with our keyword searching example, the resulting entity lists are then merged with the GDELT Article List. You can see an example of the output of the demo script below. Note that this example also showcases one key nuance of using the NGrams dataset: news outlets today rewrite their coverage constantly, in some cases performing wholesale replacement of the underlying text. Thus, some of the entities you see below no longer appear in the revised version of this AP story on the Bozeman Daily Chronicle's website today, as the article was subsequently rewritten to deemphasize certain angles of the original report.

{
"date": "2022-01-14T12:01:00.000Z",
"url": "https://www.bozemandailychronicle.com/ap_news/international/ukraines-government-websites-targeted-in-a-hacking-attack/article_b4359288-47d0-57f0-b57e-3543b5b5c5b5.html",
"domain": "bozemandailychronicle.com",
"outletName": "Bozeman Daily Chronicle",
"outletLogo": "https://www.bozemandailychronicle.com/content/tncms/site/icon.ico",
"outletTwitter": "@bozchron",
"title": "Ukraine's government websites targeted in a hacking attack",
"image": "https://bloximages.chicago2.vip.townnews.com/bozemandailychronicle.com/content/tncms/assets/v3/editorial/6/f7/6f7ac277-47dc-58c6-8afb-a6b7b8e86f72/61e15f301d31b.image.jpg?crop=1763%2C926%2C0%2C124&resize=1200%2C630&order=crop%2Cresize",
"desc": "KYIV, Ukraine (AP) -- A number of government websites in Ukraine were temporarily down on Friday after a huge hacking attack, Ukrainian officials said.",
"lang": "en",
"author": "YURAS KARMANAU Associated Press",
"entities": [{
"entity": "ukraine",
"score": 73.7
}, {
"entity": "websites",
"score": 58.9
}, {
"entity": "moscow",
"score": 45
}, {
"entity": "attacks",
"score": 35
}, {
"entity": "data",
"score": 28.6
}, {
"entity": "friday",
"score": 27.3
}, {
"entity": "personal data",
"score": 26.4
}, {
"entity": "russia",
"score": 26
}, {
"entity": "country",
"score": 24.1
}, {
"entity": "ukrainians",
"score": 22.4
}, {
"entity": "russian",
"score": 22.1
}, {
"entity": "service",
"score": 20.3
}, {
"entity": "state",
"score": 19.5
}, {
"entity": "west",
"score": 19.5
}, {
"entity": "message",
"score": 19.5
}, {
"entity": "talks",
"score": 19.4
}, {
"entity": "week",
"score": 18.8
}, {
"entity": "security",
"score": 18.7
}, {
"entity": "cyberattacks",
"score": 18.3
}, {
"entity": "nato",
"score": 18.2
}, {
"entity": "part",
"score": 18.2
}, {
"entity": "fedorov",
"score": 18
}, {
"entity": "progress",
"score": 17.3
}, {
"entity": "attack",
"score": 16.4
}, {
"entity": "cyber",
"score": 14.8
}, {
"entity": "u.s.",
"score": 13.4
}, {
"entity": "officials",
"score": 13
}, {
"entity": "tensions",
"score": 13
}, {
"entity": "hacking attack",
"score": 13
}, {
"entity": "foreign",
"score": 12.4
}, {
"entity": "meeting",
"score": 12.3
}, {
"entity": "ministry",
"score": 12.3
}, {
"entity": "washington",
"score": 12.2
}, {
"entity": "spokesman",
"score": 12.2
}, {
"entity": "oleg",
"score": 12.1
}, {
"entity": "nikolenko",
"score": 12
}, {
"entity": "government",
"score": 12
}, {
"entity": "ukrainian officials",
"score": 12
}, {
"entity": "heightened tensions",
"score": 12
}, {
"entity": "huge hacking attack",
"score": 12
}, {
"entity": "significant progress",
"score": 11.9
}, {
"entity": "conclusions",
"score": 11.7
}, {
"entity": "press",
"score": 11.7
}, {
"entity": "associated",
"score": 11.7
}, {
"entity": "investigation",
"score": 11.7
}, {
"entity": "assaults",
"score": 11.7
}, {
"entity": "involvement",
"score": 11.7
}, {
"entity": "cabinet",
"score": 11.6
}, {
"entity": "ministries",
"score": 11.4
}, {
"entity": "website",
"score": 11.2
}, {
"entity": "treasury",
"score": 11.2
}, {
"entity": "europe",
"score": 11.1
}, {
"entity": "oleg nikolenko",
"score": 11.1
}, {
"entity": "number",
"score": 11
}, {
"entity": "government websites",
"score": 11
}, {
"entity": "talks between moscow",
"score": 11
}, {
"entity": "associated press",
"score": 10.8
}, {
"entity": "long record",
"score": 10.8
}, {
"entity": "record",
"score": 10.8
}, {
"entity": "cyber assaults",
"score": 10.8
}, {
"entity": "domain",
"score": 10.8
}, {
"entity": "seven ministries",
"score": 10.6
}, {
"entity": "services",
"score": 10.4
}, {
"entity": "result",
"score": 10.4
}, {
"entity": "passports",
"score": 10.4
}, {
"entity": "services website",
"score": 10.4
}, {
"entity": "emergency",
"score": 10.4
}, {
"entity": "vaccination",
"score": 10.4
}, {
"entity": "national",
"score": 10.4
}, {
"entity": "certificates",
"score": 10.4
}, {
"entity": "eu",
"score": 10.3
}, {
"entity": "spokesman oleg nikolenko",
"score": 10.2
}, {
"entity": "public domain",
"score": 10.1
}, {
"entity": "a number",
"score": 10
}, {
"entity": "a",
"score": 10
}, {
"entity": "heightened tensions with russia",
"score": 10
}, {
"entity": "russian cyber assaults",
"score": 9.9
}, {
"entity": "significant progress this week",
"score": 9.9
}, {
"entity": "transformation",
"score": 9.8
}, {
"entity": "emergency service",
"score": 9.6
}, {
"entity": "unavailable friday",
"score": 9.6
}, {
"entity": "state services website",
"score": 9.6
}, {
"entity": "vaccination certificates",
"score": 9.6
}, {
"entity": "electronic passports",
"score": 9.6
}, {
"entity": "borrell",
"score": 9.6
}, {
"entity": "registries",
"score": 9.5
}, {
"entity": "ministry spokesman oleg nikolenko",
"score": 9.3
}, {
"entity": "websites of the country",
"score": 9.3
}, {
"entity": "communication",
"score": 9.1
}, {
"entity": "protection",
"score": 9.1
}, {
"entity": "minister",
"score": 9.1
}, {
"entity": "operability",
"score": 9.1
}, {
"entity": "mykhailo",
"score": 9.1
}, {
"entity": "digital transformation",
"score": 9.1
}, {
"entity": "information",
"score": 9.1
}, {
"entity": "mykhailo fedorov",
"score": 9.1
}, {
"entity": "resources",
"score": 9.1
}, {
"entity": "ap",
"score": 9
}, {
"entity": "countries",
"score": 8.8
}, {
"entity": "national emergency service",
"score": 8.8
}, {
"entity": "state service",
"score": 8.4
}, {
"entity": "information protection",
"score": 8.4
}, {
"entity": "foreign ministry spokesman oleg nikolenko",
"score": 8.4
}, {
"entity": "conclusions as the investigation",
"score": 8.1
}, {
"entity": "involvement in cyberattacks against ukraine",
"score": 8.1
}, {
"entity": "invasion",
"score": 8
}, {
"entity": "administrators",
"score": 7.8
}, {
"entity": "buildup",
"score": 7.8
}, {
"entity": "damage",
"score": 7.8
}, {
"entity": "order",
"score": 7.8
}, {
"entity": "estimates",
"score": 7.8
}, {
"entity": "large part",
"score": 7.8
}, {
"entity": "troops",
"score": 7.8
}, {
"entity": "minister for digital transformation",
"score": 7.7
}, {
"entity": "operability of the websites",
"score": 7.7
}, {
"entity": "fears",
"score": 7.7
}, {
"entity": "cooperation",
"score": 7.6
}, {
"entity": "unavailable friday as a result",
"score": 7.2
}, {
"entity": "stoked fears",
"score": 7.2
}, {
"entity": "estimates russia",
"score": 7.2
}, {
"entity": "affected websites",
"score": 7.2
}, {
"entity": "attacked websites",
"score": 7.2
}, {
"entity": "state service of communication",
"score": 7
}, {
"entity": "forces",
"score": 7
}, {
"entity": "union",
"score": 7
}, {
"entity": "month",
"score": 7
}, {
"entity": "plans",
"score": 6.8
}, {
"entity": "u.s. estimates russia",
"score": 6.6
}, {
"entity": "100,000 troops near ukraine",
"score": 6.6
}, {
"entity": "administrators in order",
"score": 6.6
}, {
"entity": "precluding",
"score": 6.5
}, {
"entity": "right",
"score": 6.5
}, {
"entity": "guarantees",
"score": 6.5
}, {
"entity": "precluding nato",
"score": 6.5
}, {
"entity": "draft",
"score": 6.5
}, {
"entity": "last month",
"score": 6.5
}, {
"entity": "kremlin",
"score": 6.5
}, {
"entity": "expansion",
"score": 6.5
}, {
"entity": "membership",
"score": 6.5
}, {
"entity": "demand",
"score": 6.5
}, {
"entity": "deploy",
"score": 6.5
}, {
"entity": "documents",
"score": 6.5
}, {
"entity": "alliance",
"score": 6.2
}, {
"entity": "roll",
"score": 6
}, {
"entity": "kyiv",
"score": 6
}, {
"entity": "security guarantees",
"score": 6
}, {
"entity": "west precluding nato",
"score": 6
}, {
"entity": "security documents",
"score": 6
}, {
"entity": "stoked fears of an invasion",
"score": 6
}, {
"entity": "soviet countries",
"score": 5.8
}, {
"entity": "paper",
"score": 5.8
}, {
"entity": "pledges",
"score": 5.6
}, {
"entity": "deployments",
"score": 5.5
}, {
"entity": "draft security documents",
"score": 5.5
}, {
"entity": "demanded security guarantees",
"score": 5.5
}, {
"entity": "brest",
"score": 5.5
}, {
"entity": "former soviet countries",
"score": 5.4
}, {
"entity": "central",
"score": 5.3
}, {
"entity": "member",
"score": 5.2
}, {
"entity": "allies",
"score": 5.2
}, {
"entity": "representatives",
"score": 5.2
}, {
"entity": "such pledges",
"score": 5.2
}, {
"entity": "eastern europe",
"score": 5.2
}, {
"entity": "eastern",
"score": 5.2
}, {
"entity": "organization",
"score": 5.2
}, {
"entity": "military deployments",
"score": 5.1
}, {
"entity": "other former soviet countries",
"score": 5
}, {
"entity": "nato representatives",
"score": 4.8
}, {
"entity": "high-stakes talks",
"score": 4.8
}, {
"entity": "online",
"score": 4.8
}, {
"entity": "online paper",
"score": 4.8
}, {
"entity": "cooperation in europe",
"score": 4.6
}, {
"entity": "meeting of russia",
"score": 4.4
}, {
"entity": "immediate progress",
"score": 4.2
}, {
"entity": "military deployments in central",
"score": 4.1
}, {
"entity": "port",
"score": 3.9
}, {
"entity": "assistance",
"score": 3.9
}, {
"entity": "city",
"score": 3.9
}, {
"entity": "capacity",
"score": 3.9
}, {
"entity": "bloc",
"score": 3.9
}, {
"entity": "ministers",
"score": 3.9
}, {
"entity": "josep",
"score": 3.9
}, {
"entity": "diplomat",
"score": 3.9
}, {
"entity": "european union",
"score": 3.9
}, {
"entity": "top diplomat",
"score": 3.6
}, {
"entity": "josep borrell",
"score": 3.6
}, {
"entity": "port city",
"score": 3.6
}, {
"entity": "foreign ministers",
"score": 3.6
}, {
"entity": "technical assistance",
"score": 3.6
}, {
"entity": "eu foreign ministers",
"score": 3.3
}, {
"entity": "french port city",
"score": 3.3
}, {
"entity": "teams",
"score": 3.1
}, {
"entity": "pesco",
"score": 3
}, {
"entity": "response teams",
"score": 2.9
}, {
"entity": "response",
"score": 2.9
}, {
"entity": "friday that the 27-nation bloc",
"score": 2.7
}, {
"entity": "rapid response teams",
"score": 2.7
}, {
"entity": "framework",
"score": 2.6
}, {
"entity": "benefit",
"score": 2.6
}, {
"entity": "anti-cyberattacks",
"score": 2.6
}, {
"entity": "cyber rapid response teams",
"score": 2.5
}, {
"entity": "structured cooperation",
"score": 2.4
}, {
"entity": "member countries",
"score": 2.4
}, {
"entity": "ukraine benefit",
"score": 2.4
}, {
"entity": "anti-cyberattacks resources",
"score": 2.4
}, {
"entity": "county",
"score": 2.2
}, {
"entity": "member of the union",
"score": 2.2
}, {
"entity": "permanent structured cooperation",
"score": 2.2
}, {
"entity": "eu permanent structured cooperation",
"score": 2
}, {
"entity": "russian cyber assaults against ukraine",
"score": 1.8
}, {
"entity": "report",
"score": 1.4
}, {
"entity": "gallatin",
"score": 1.3
}, {
"entity": "support",
"score": 1.3
}, {
"entity": "france",
"score": 1.3
}, {
"entity": "dasha",
"score": 1.3
}, {
"entity": "litvinova",
"score": 1.3
}, {
"entity": "proof",
"score": 1.3
}, {
"entity": "anybody",
"score": 1.3
}, {
"entity": "point",
"score": 1.3
}, {
"entity": "journalism",
"score": 1.3
}, {
"entity": "gallatin county",
"score": 1.2
}, {
"entity": "local journalism",
"score": 1.2
}, {
"entity": "dasha litvinova",
"score": 1.2
}, {
"entity": "gaschka",
"score": 1.2
}, {
"entity": "catherine",
"score": 1.2
}, {
"entity": "support local journalism",
"score": 1.1
}, {
"entity": "catherine gaschka",
"score": 1.1
}, {
"entity": "point at anybody",
"score": 1.1
}, {
"entity": "government websites in ukraine",
"score": 1
}, {
"entity": "a number of government websites",
"score": 1
}, {
"entity": "catherine gaschka in brest",
"score": 1
}, {
"entity": "dasha litvinova in moscow",
"score": 1
}, {
"entity": "involvement in cyberattacks",
"score": 0.9
}, {
"entity": "meeting of eu foreign ministers",
"score": 0.9
}, {
"entity": "organization for security",
"score": 0.8
}, {
"entity": "cyberattacks against ukraine",
"score": 0.8
}, {
"entity": "100,000 troops",
"score": 0.6
}, {
"entity": "french port city of brest",
"score": 0.6
}, {
"entity": "u.s",
"score": 0.6
}, {
"entity": "27-nation bloc",
"score": 0.6
}, {
"entity": "ukraine benefit from anti-cyberattacks resources",
"score": 0.4
}, {
"entity": "high-stakes talks this week",
"score": 0.4
}, {
"entity": "week between moscow",
"score": 0.4
}, {
"entity": "meeting at the organization",
"score": 0.4
}, {
"entity": "meeting of eu",
"score": 0.3
}]
}