The GDELT Project

Neural Entity Extraction, Disambiguation, Sentiment, Salience, Knowledge Graph Linking And Contextualization Via Web NGrams 3.0

Earlier this week we showed how to use the new Web NGrams 3.0 dataset to perform custom entity extraction over the news, by applying a part of speech tagger to contextual the snippets. Alternatively, for more advanced applications, what if we want to apply a neural entity extractor that doesn't just extract entities, but also disambiguates them, ties them into a knowledge graph, computes their sentiment and salience and contextualizes the reference, either locally through surrounding terms or globally through article categorization? We can do all of that and more by using the same workflow as yesterday's document-level neural categorization example!

First, let's look at what happens if we just use the workflow from our custom entity extraction example, which analyzed the ngrams at a snippet level. Let's say we wanted to understand global news sentiment towards Tesla on a given today. We would first scan the Web NGrams 3.0 dataset for all records where "ngram" was "Tesla" and then run each snippet through Google's Natural Language API. For example, take this ngrams record from "Babylon Again Delivers Significant Growth to Reach Over 440,000 Managed Lives Globally, After Signing New US Value-Based Care Agreements":

{ "date": "2022-01-14T12:01:00.000Z", "ngram": "Tesla,", "lang": "ENGLISH", "pos": 50, "pre": "disruptive digital innovators such as Amazon, Netflix,", "post": "or Airbnb, who also experienced similar growth", "url": "https://www.wallstreet-online.de/nachricht/14907797-babylon-again-delivers-significant-growth-to-reach-over-440-000-managed-lives-globally-after-signing-new-us-value-based-care-agreements" }

Concatenating the pre+ngram+post fields together, we get the following snippet:

disruptive digital innovators such as Amazon, Netflix, Tesla, or Airbnb, who also experienced similar growth

Running this through the NLP API we get:

{
  "entities": [
    {
      "mentions": [
        {
          "sentiment": {
            "magnitude": 0.1,
            "score": 0.1
          },
          "text": {
            "beginOffset": 19,
            "content": "innovators"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "innovators",
      "salience": 0.56324583,
      "sentiment": {
        "magnitude": 0.1,
        "score": 0.1
      },
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "sentiment": {
            "magnitude": 0.1,
            "score": 0.1
          },
          "text": {
            "beginOffset": 55,
            "content": "Tesla"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/0dr90d",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Tesla,_Inc."
      },
      "name": "Tesla",
      "salience": 0.10439751,
      "sentiment": {
        "magnitude": 0.1,
        "score": 0.1
      },
      "type": "ORGANIZATION"
    },
    {
      "mentions": [
        {
          "sentiment": {
            "magnitude": 0.1,
            "score": 0.1
          },
          "text": {
            "beginOffset": 38,
            "content": "Amazon"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {},
      "name": "Amazon",
      "salience": 0.10316856,
      "sentiment": {
        "magnitude": 0.1,
        "score": 0.1
      },
      "type": "ORGANIZATION"
    },
    {
      "mentions": [
        {
          "sentiment": {
            "magnitude": 0.1,
            "score": 0.1
          },
          "text": {
            "beginOffset": 46,
            "content": "Netflix"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/017rf_",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Netflix"
      },
      "name": "Netflix",
      "salience": 0.10316856,
      "sentiment": {
        "magnitude": 0.1,
        "score": 0.1
      },
      "type": "ORGANIZATION"
    },
    {
      "mentions": [
        {
          "sentiment": {
            "magnitude": 0.1,
            "score": 0.1
          },
          "text": {
            "beginOffset": 102,
            "content": "growth"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "growth",
      "salience": 0.09353214,
      "sentiment": {
        "magnitude": 0.1,
        "score": 0.1
      },
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "sentiment": {
            "magnitude": 0.1,
            "score": 0.1
          },
          "text": {
            "beginOffset": 65,
            "content": "Airbnb"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/010qmszp",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Airbnb"
      },
      "name": "Airbnb",
      "salience": 0.032487407,
      "sentiment": {
        "magnitude": 0.1,
        "score": 0.1
      },
      "type": "OTHER"
    }
  ],
  "language": "en"
}

Note how it picks up Tesla, connecting it to both its Wikipedia article and MID code. Based just on this snippet, we can see that Tesla has relatively low salience and emotional tenor, but that it is a positive reference.

At the same time, since the API is examining only this snippet of text in isolation, the salience and sentiment scores do not reflect the reality of the article as a whole, as evidenced by the extremely high salience scores. What we need to do is to allow the Natural Language API to analyze the document as a whole, to see the range of contexts each entity appears in, just as we did in yesterday's neural categorization example.

In fact, we can simply take the Perl script from yesterday's neural categorization example and make a single change to do all of this!

Just change the command below:

gcloud ml language classify-text --content-file=./tmp.txt > out.json

to this:

gcloud ml language analyze-entity-sentiment --content-file=./tmp.txt > out.json

That's it! You'll now get the results above, but computed at an article level.

This time, the API computes that the most salient entities are "revenue," "company," "revenue expectations," "Babylon," "JP Morgan Conference" and "Ali Parsa." In contrast, "Amazon," "Netflix," "Tesla," and "Airbnb" all share the same salience score that is an order of magnitude lower. This matches nearly exactly what we get from applying the API to the original fulltext. While the raw scores are different than those using the original fulltext, the relative ordering and magnitude differences are nearly identical, along with similar ordering of sentiment scores.

Disambiguation performs accurately, correctly identifying that "Amazon" refers to Amazon the company in this context, rather than the river, while "Tesla" refers to the car manufacturer, not the person, and so on. You could repeat the gcloud command and invoke the API for both entity and categorization for each article, then use the category tags as an additional relevance filter, matching only Tesla mentions in articles about the electric vehicle industry, for example.