Experiments In Entity Extraction Using LLMs: Hallucination & How A Single Apostrophe Can Change The Results

Last month we explored LLM-based entity extraction, with mixed results. In theory, LLMs should offer vastly superior entity identification, segmentation and disambiguation/normalization due to their vast linguistic knowledge. In practice, however, their instability, inconsistency and inability to normalize makes them difficult to utilize in real-world applications. Let's explore LLM entity extraction in more detail through the eyes of PaLM 2's Bison model.

The results are highly unexpected: despite their vastly larger linguistic archives, LLM entity extractors are actually massively more brittle and unpredictable than traditional neutral and statistical extractors. The appearance of a single apostrophe can completely change the list of extracted entities unrelated to the word it is added to and the order of names in a sentence can determine whether a given name is correctly or incorrectly disambiguated. Correcting a single typo in a text can yield very different extracted entities for a range of names in the text unrelated to the corrected word. Hallucinations and incorrect disambiguations vary based on whether another name appears before or after the given name. Like the classical NMT efforts that preceded them, but unlike classical neural and statistical entity extractors, even the smallest of changes can yield dramatically different results from an LLM entity extractor. The results here suggest significant caution is warranted when exploring LLMs for extraction tasks.

We can invoke the PaLM API via the following (replace "[YOURPROJECTID]" with your GCP project ID). Here we pipe the results through jq to extract just the response field and to prettify it for display. Given that the knowledge cutoff of many LLMs is 2021, let's provide a sample text mentioning the president-elect of Paraguay. Real-world entity extraction systems gain most of their power from their ability to disambiguate and normalize names, understanding the difference between a mention of "Paris" referencing "Paris, France", "Paris, Illinois", "Paris Hilton", the "Hilton in Paris" or any other myriad possibilities. To replicate this, we've asked the LLM to provide the Wikipedia entry for each name as an equivalent form of normalization.

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/[YOURPROJECTID]/locations/us-central1/publishers/google/models/text-bison:predict -d \
$'{
  "instances": [
    { "prompt": "Extract a list of the names in the following text, disambiguating and normalizing each name to the title of its corresponding Wikipedia article and outputting the results in JSON in the format \\"{name, wikipedia url}\\" :  On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections.  "}
  ],
  "parameters": {
    "temperature": 0.99,
    "maxOutputTokens": 700,
    "topK": 40,
    "topP": 0.95
  }
}'  | jq -r .predictions[0].content

Let's run it 10 times in a row and see how the results turn out. As you can see from the results below, the output is extremely unstable due to our high temperature setting (which maximizes randomness / "creativity"). Some outputs are wrapped in triple-ticks, some in triple-ticks plus "json", others in braces and others in brackets. Six of the ten outputs are wrapped in prefix/suffix content that makes them invalid JSON and thus unparseable by downstream processors. This means that any LLM-based workflows that desire machine-readable output will need to utilize flexible prefiltering to remove these. One of the outputs violates the requested two-field output.

All ten outputs correctly extract the "Colorado Party" and link to its correct Wikipedia page. In contrast, while all ten outputs correctly extract Santiago Peña's name, they hallucinate various Wikipedia URLs including "Santiago_Peña_(politician)", "Santiago_P%C3%A9%C3%B1a_(politician)", "Santiago_P%C3%A9na_(politician)" (accent on the e rather than the n), "Santiago_Peña_(Paraguayan_politician)" and "Santiago_P%C3%A1%C3%B1a_(politician)" (changing the e to an a). Just three of the ten outputs (30%) yield the correct Wikipedia URL of "Santiago_Peña". This is problematic in that this means that when asked to normalize mentions of the Paraguayan president it sees across news articles, the LLM will output various different normalized forms.

```
[
  {"name": "Santiago Peña", "wikipedia url": "https://en.wikipedia.org/wiki/Santiago_Peña_(politician)"},
  {"name": "Colorado Party", "wikipedia url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)"}
]
```

```json
[
  {
    "name": "Santiago Peña",
    "wikipedia url": "https://en.wikipedia.org/wiki/Santiago_P%C3%A9%C3%B1a_(politician)"
  },
  {
    "name": "Colorado Party",
    "wikipedia url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)"
  }
]
```

[
  {"name": "Santiago Peña", "wikipedia url": "https://en.wikipedia.org/wiki/Santiago_P%C3%A9%C3%B1a_(politician)"},
  {"name": "Colorado Party", "wikipedia url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)"}
]

```json
[
  {
    "name": "Santiago Peña",
    "wikipedia url": "https://en.wikipedia.org/wiki/Santiago_Peña_(politician)"
  },
  {
    "name": "Colorado Party",
    "wikipedia url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)"
  }
]
```

```json
[
  { "name": "Santiago Peña", "wikipedia url": "https://en.wikipedia.org/wiki/Santiago_P%C3%A9na_(politician)" },
  { "name": "Colorado Party", "wikipedia url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)" }
]
```

{
  "Santiago Peña": "https://en.wikipedia.org/wiki/Santiago_Peña",
  "Colorado Party": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)"
}

```
[
  {
    "name": "Santiago Peña",
    "wikipedia url": "https://en.wikipedia.org/wiki/Santiago_Peña"
  },
  {
    "name": "Colorado Party",
    "wikipedia url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)"
  }
]
```

[
  {
    "name": "Santiago Peña",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Santiago_Peña"
  },
  {
    "name": "Colorado Party",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)"
  }
]

[
  {
    "name": "Santiago Peña",
    "wikipedia url": "https://en.wikipedia.org/wiki/Santiago_Peña_(Paraguayan_politician)"
  },
  {
    "name": "Colorado Party",
    "wikipedia url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)"
  }
]

```json
[
  {"name": "Santiago Peña", "wikipedia_url": "https://en.wikipedia.org/wiki/Santiago_P%C3%A1%C3%B1a_(politician)"},
  {"name": "Colorado Party", "wikipedia_url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)"}
]
```

Of course, we used an extremely high temperature setting in those examples, which vastly increases the model's creativity. What if we instead leave all parameters their default, other than increasing the output token maximum? The current default value of temperature is 0, meaning these outputs should yield exactly the same results each time. Note that true generative applications such as summarization and ideation typically recommend a temperature of at least 0.2, with 0.0 being recommended for factual tasks.

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/[YOURPROJECTID]/locations/us-central1/publishers/google/models/text-bison:predict -d \
$'{
  "instances": [
    { "prompt": "Extract a list of the names in the following text, disambiguating and normalizing each name to the title of its corresponding Wikipedia article and outputting the results in JSON in the format \\"{name, wikipedia url}\\" :  On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections.  "}
  ],
  "parameters": {
    "maxOutputTokens": 700,
  }
}'  | jq -r .predictions[0].content

This time the results are expectedly identical for each run and correct.

[
  {
    "name": "Santiago Peña",
    "wikipedia url": "https://en.wikipedia.org/wiki/Santiago_Peña"
  },
  {
    "name": "Colorado Party",
    "wikipedia url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)"
  }
]

[
  {
    "name": "Santiago Peña",
    "wikipedia url": "https://en.wikipedia.org/wiki/Santiago_Peña"
  },
  {
    "name": "Colorado Party",
    "wikipedia url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)"
  }
]

[
  {
    "name": "Santiago Peña",
    "wikipedia url": "https://en.wikipedia.org/wiki/Santiago_Peña"
  },
  {
    "name": "Colorado Party",
    "wikipedia url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)"
  }
]

[
  {
    "name": "Santiago Peña",
    "wikipedia url": "https://en.wikipedia.org/wiki/Santiago_Peña"
  },
  {
    "name": "Colorado Party",
    "wikipedia url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)"
  }
]

What about a more complex example with more entities mentioned?

time curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/[YOURPROJECTID]/locations/us-central1/publishers/google/models/text-bison:predict -d \
$'{
  "instances": [
    { "prompt": "Extract a list of the people, organizations, locations and other entities in the following text, disambiguating and normalizing each name to the title of its corresponding Wikipedia article. Classify the Type of each as one of the following: \\"Organization,Person,Location,Event,Other\\". Output the results in JSON in the format \\"{name, wikipedia url, type}\\" :  Deputy Secretary of State Wendy Sherman met with Latvian Foreign Minister Edgars Rinkevics today in Washington, D.C. Deputy Secretary Sherman and Foreign Minister Rinkevics in DC discussed the upcoming NATO Summit in Vilnius and NATO\'s continuing work to strengthen Eastern Flank defense, counter Russian disinformation, and diplomatically isolate Russia at the last Washington meeting. They also discussed ways to deepen U.S.-Latvia coordination in the Indo-Pacific and counter PRC economic coercion. "}
  ],
  "parameters": {
    "temperature": 0.0,
    "maxOutputTokens": 700,
    "topK": 1,
    "topP": 0.0
  }
}'  | jq -r .predictions[0].content

This yields the following, which seems reasonable:

[
  {
    "name": "Deputy Secretary of State Wendy Sherman",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Wendy_Sherman",
    "type": "Person"
  },
  {
    "name": "Latvian Foreign Minister Edgars Rinkevics",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Edgars_Rinkevicis",
    "type": "Person"
  },
  {
    "name": "Washington, D.C.",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Washington,_D.C.",
    "type": "Location"
  },
  {
    "name": "NATO Summit",
    "wikipedia_url": "https://en.wikipedia.org/wiki/NATO_Summit",
    "type": "Event"
  },
  {
    "name": "Vilnius",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Vilnius",
    "type": "Location"
  },
  {
    "name": "NATO",
    "wikipedia_url": "https://en.wikipedia.org/wiki/NATO",
    "type": "Organization"
  },
  {
    "name": "Russia",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Russia",
    "type": "Country"
  },
  {
    "name": "U.S.",
    "wikipedia_url": "https://en.wikipedia.org/wiki/United_States",
    "type": "Country"
  },
  {
    "name": "Latvia",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Latvia",
    "type": "Country"
  },
  {
    "name": "Indo-Pacific",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Indo-Pacific",
    "type": "Region"
  },
  {
    "name": "PRC",
    "wikipedia_url": "https://en.wikipedia.org/wiki/People%27s_Republic_of_China",
    "type": "Country"
  }
]

The API output also includes information about the total input and output tokens and billable characters. In this case we can see that the API output 594 tokens, which is well below our output cap of 700 tokens:

"metadata": {
    "tokenMetadata": {
      "inputTokenCount": {
        "totalTokens": 174,
        "totalBillableCharacters": 739
      },
      "outputTokenCount": {
        "totalTokens": 594,
        "totalBillableCharacters": 1305
      }
    }
}

What if we increase the max token output to 1024 by changing "maxOutputTokens"? The results remain the same, as expected. However, notice how the source text includes the word "NATOs". In this case there was a typographical error in the original text where it lacked the possessive apostrophe. What if we make only a single change to the text: we add in that apostrophe. That is the only change we make to the input – the addition of a single solitary apostrophe:

time curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/[YOURPROJECTID]/locations/us-central1/publishers/google/models/text-bison:predict -d \
$'{
  "instances": [
    { "prompt": "Extract a list of the people, organizations, locations and other entities in the following text, disambiguating and normalizing each name to the title of its corresponding Wikipedia article. Classify the Type of each as one of the following: \\"Organization,Person,Location,Event,Other\\". Output the results in JSON in the format \\"{name, wikipedia url, type}\\" :  Deputy Secretary of State Wendy Sherman met with Latvian Foreign Minister Edgars Rinkevics today in Washington, D.C. Deputy Secretary Sherman and Foreign Minister Rinkevics in DC discussed the upcoming NATO Summit in Vilnius and NATO\'s continuing work to strengthen Eastern Flank defense, counter Russian disinformation, and diplomatically isolate Russia at the last Washington meeting. They also discussed ways to deepen U.S.-Latvia coordination in the Indo-Pacific and counter PRC economic coercion. "}
  ],
  "parameters": {
    "temperature": 0.0,
    "maxOutputTokens": 1024,
    "topK": 1,
    "topP": 0.0
  }
}'  | jq -r .predictions[0].content

The new results can be seen below. Latvia and China disappear from the results. Two new legitimate Wikipedia entities are output: "Russian_disinformation" and "U.S.-Latvia_relations". However, hallucinated Wikipedia entries "Eastern_Flank_of_NATO" and "Chinese_economic_coercion" are also output. Remember that the only change between these two texts was the addition of a single apostrophe to correct a typographical error in the original text.

The output still fails to add "Washington" or "DC" to its entity list. While China's "economic coercion" appears, China itself does not appear as a standalone item and similarly, Latvia does not appear on its own either, even while both appeared in our earlier version. Thus, while adding these more complex phrases, it drops more important anchoring entities. A downstream search engine using this extracted entities list for search would therefore not associate the article with Latvia or China given this new entities list.

The ability of a single apostrophe to so dramatically change the output is highly concerning and offers a reminder of just how brittle LLM output can be, even from a SOTA model like PaLM 2.

[
  {
    "name": "Deputy Secretary of State Wendy Sherman",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Wendy_Sherman",
    "type": "Person"
  },
  {
    "name": "Latvian Foreign Minister Edgars Rinkevics",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Edgars_Rinkevicis",
    "type": "Person"
  },
  {
    "name": "Washington, D.C.",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Washington,_D.C.",
    "type": "Location"
  },
  {
    "name": "NATO Summit",
    "wikipedia_url": "https://en.wikipedia.org/wiki/NATO_Summit",
    "type": "Event"
  },
  {
    "name": "Vilnius",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Vilnius",
    "type": "Location"
  },
  {
    "name": "NATO",
    "wikipedia_url": "https://en.wikipedia.org/wiki/NATO",
    "type": "Organization"
  },
  {
    "name": "Eastern Flank",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Eastern_Flank_of_NATO",
    "type": "Other"
  },
  {
    "name": "Russian disinformation",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Russian_disinformation",
    "type": "Other"
  },
  {
    "name": "Russia",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Russia",
    "type": "Country"
  },
  {
    "name": "U.S.-Latvia coordination",
    "wikipedia_url": "https://en.wikipedia.org/wiki/U.S.-Latvia_relations",
    "type": "Other"
  },
  {
    "name": "Indo-Pacific",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Indo-Pacific",
    "type": "Other"
  },
  {
    "name": "PRC economic coercion",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Chinese_economic_coercion",
    "type": "Other"
  }
]

Let's compare with the GCP NLP API:

gcloud ml language analyze-entities --content="Deputy Secretary of State Wendy Sherman met with Latvian Foreign Minister Edgars Rinkevics today in Washington, D.C. Deputy Secretary Sherman and Foreign Minister Rinkevics in DC discussed the upcoming NATO Summit in Vilnius and NATO's continuing work to strengthen Eastern Flank defense, counter Russian disinformation, and diplomatically isolate Russia at the last Washington meeting. They also discussed ways to deepen U.S.-Latvia coordination in the Indo-Pacific and counter PRC economic coercion."

This output is vastly richer. Though, we can see part of this richness is the way in which it has decomposed the text in its various permutations, linking "Wendy Sherman", "Deputy Secretary of State", "Sherman" and "Deputy Secretary" all to the proper name entity of Wendy Sherman. It also makes some mistakes, such as counting "Washington" as a reference to Washington state, rather than Washington, DC. One important attribute of the output is the way in which it lists every mention of an entity, rather than just a single mention. For example, "Washington, DC" is referenced twice more in the text, with the subsequent mentions in the form of "DC" and "Washington". The API incorrectly codes the "Washington" mention as the state, but correctly codes "DC" as a second mention to "Washington, DC". Our PaLM output above did not feature subsequent mentions.

{
  "entities": [
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 74,
            "content": "Edgars Rinkevics"
          },
          "type": "PROPER"
        },
        {
          "text": {
            "beginOffset": 57,
            "content": "Foreign Minister"
          },
          "type": "COMMON"
        },
        {
          "text": {
            "beginOffset": 163,
            "content": "Rinkevics"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/0h_d362",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Edgars_Rinkēvičs"
      },
      "name": "Edgars Rinkevics",
      "salience": 0.31754252,
      "type": "PERSON"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 26,
            "content": "Wendy Sherman"
          },
          "type": "PROPER"
        },
        {
          "text": {
            "beginOffset": 0,
            "content": "Deputy Secretary of State"
          },
          "type": "COMMON"
        },
        {
          "text": {
            "beginOffset": 134,
            "content": "Sherman"
          },
          "type": "PROPER"
        },
        {
          "text": {
            "beginOffset": 117,
            "content": "Deputy Secretary"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {
        "mid": "/m/0gkhcs",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Wendy_Sherman"
      },
      "name": "Wendy Sherman",
      "salience": 0.30463105,
      "type": "PERSON"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 49,
            "content": "Latvian"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {},
      "name": "Latvian",
      "salience": 0.07819497,
      "type": "LOCATION"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 100,
            "content": "Washington, D.C."
          },
          "type": "PROPER"
        },
        {
          "text": {
            "beginOffset": 176,
            "content": "DC"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/0rh6k",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Washington,_D.C."
      },
      "name": "Washington, D.C.",
      "salience": 0.034307517,
      "type": "LOCATION"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 146,
            "content": "Foreign Minister"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "Foreign Minister",
      "salience": 0.018897872,
      "type": "PERSON"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 247,
            "content": "work"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "work",
      "salience": 0.01681741,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 280,
            "content": "defense"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "defense",
      "salience": 0.015437325,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 305,
            "content": "disinformation"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "disinformation",
      "salience": 0.015437325,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 407,
            "content": "ways"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "ways",
      "salience": 0.011590077,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 434,
            "content": "coordination"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "coordination",
      "salience": 0.011590077,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 229,
            "content": "NATO"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/059dn",
        "wikipedia_url": "https://en.wikipedia.org/wiki/NATO"
      },
      "name": "NATO",
      "salience": 0.011338662,
      "type": "ORGANIZATION"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 202,
            "content": "NATO Summit"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/02r330x",
        "wikipedia_url": "https://en.wikipedia.org/wiki/NATO_summit"
      },
      "name": "NATO Summit",
      "salience": 0.00996378,
      "type": "EVENT"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 217,
            "content": "Vilnius"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/07_kq",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Vilnius"
      },
      "name": "Vilnius",
      "salience": 0.00996378,
      "type": "LOCATION"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 266,
            "content": "Eastern Flank"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {},
      "name": "Eastern Flank",
      "salience": 0.009143886,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 297,
            "content": "Russian"
          },
          "type": "PROPER"
        },
        {
          "text": {
            "beginOffset": 348,
            "content": "Russia"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/06bnz",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Russia"
      },
      "name": "Russian",
      "salience": 0.009143886,
      "type": "LOCATION"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 492,
            "content": "coercion"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "coercion",
      "salience": 0.008120808,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 427,
            "content": "Latvia"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/04g5k",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Latvia"
      },
      "name": "Latvia",
      "salience": 0.0068603926,
      "type": "LOCATION"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 422,
            "content": "U.S."
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/09c7w0",
        "wikipedia_url": "https://en.wikipedia.org/wiki/United_States"
      },
      "name": "U.S.",
      "salience": 0.0068603926,
      "type": "LOCATION"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 378,
            "content": "meeting"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "meeting",
      "salience": 0.0051173554,
      "type": "EVENT"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 454,
            "content": "Indo-Pacific"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/01wvkz",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Indo-Pacific"
      },
      "name": "Indo-Pacific",
      "salience": 0.004803912,
      "type": "LOCATION"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 479,
            "content": "PRC"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {},
      "name": "PRC",
      "salience": 0.004803912,
      "type": "ORGANIZATION"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 367,
            "content": "Washington"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/081yw",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Washington_(state)"
      },
      "name": "Washington",
      "salience": 0.0034445196,
      "type": "LOCATION"
    }
  ],
  "language": "en"
}

What if we modify the PaLM prompt to request that it output all mentions? We'll add the instruction "Include every mention of each, even if it appears multiple times in the text." to our prompt:

time curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/[YOURPROJECTID]/locations/us-central1/publishers/google/models/text-bison:predict -d \
$'{
  "instances": [
    { "prompt": "Extract a list of the people, organizations, locations and other entities in the following text, disambiguating and normalizing each name to the title of its corresponding Wikipedia article. Classify the Type of each as one of the following: \\"Organization,Person,Location,Event,Other\\". Include every mention of each, even if it appears multiple times in the text. Output the results in JSON in the format \\"{name, wikipedia url, type}\\" :  Deputy Secretary of State Wendy Sherman met with Latvian Foreign Minister Edgars Rinkevics today in Washington, D.C. Deputy Secretary Sherman and Foreign Minister Rinkevics in DC discussed the upcoming NATO Summit in Vilnius and NATO\'s continuing work to strengthen Eastern Flank defense, counter Russian disinformation, and diplomatically isolate Russia at the last Washington meeting. They also discussed ways to deepen U.S.-Latvia coordination in the Indo-Pacific and counter PRC economic coercion. "}
  ],
  "parameters": {
    "temperature": 0.0,
    "maxOutputTokens": 700,
    "topK": 1,
    "topP": 0.0
  }
}'  | jq -r .predictions[0].content

This unfortunately does not change our output at all: the model still skips over subsequent mentions in the text. At first glance this might not seem a big issue, but it means that if we want to connect mentions back to their character offsets in the text (for visual display annotation, proximity search, etc), we will miss these subsequent mentions since our postprocessing engine won't know to extract "DC" and connect back to "Washington, DC".

Let's add four more sentences to our output. Remember, the only change we are making is to add four sentences:

time curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/[YOURPROJECTID]/locations/us-central1/publishers/google/models/text-bison:predict -d \
$'{
"instances": [
{ "prompt": "Extract a list of the people, organizations, locations and other entities in the following text, disambiguating and normalizing each name to the title of its corresponding Wikipedia article. Classify the Type of each as one of the following: \\"Organization,Person,Location,Event,Other\\". Output the results in JSON in the format \\"{name, wikipedia url, type}\\" : Deputy Secretary of State Wendy Sherman met with Latvian Foreign Minister Edgars Rinkevics today in Washington, D.C. Deputy Secretary Sherman and Foreign Minister Rinkevics in DC discussed the upcoming NATO Summit in Vilnius and NATO\'s continuing work to strengthen Eastern Flank defense, counter Russian disinformation, and diplomatically isolate Russia at the last Washington meeting. They also discussed ways to deepen U.S.-Latvia coordination in the Indo-Pacific and counter PRC economic coercion. President Biden announced today a new policy towards China. Biden announced today a new policy towards the Chinese government. Joseph Biden announced today a new policy towards China\'s government. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections. "}
],
"parameters": {
"temperature": 0.0,
"maxOutputTokens": 700,
"topK": 1,
"topP": 0.0
}
}' | jq -r .predictions[0].content

This time the output is truncated and is missing the final bracket.

...
{
    "name": "China",
    "wikipedia_url": "https://en.wikipedia.org/wiki/China",
    "type": "Country"
  },
  {
    "name": "Chinese government",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Government_of_China",
    "type": "Organization"
  },
  {
    "name": "Joseph Biden",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Joe_Biden",
    "type": "Person"
  },

If we look at the output statistics we can see we've hit our output cap of 700 tokens, explaining the truncation:

"metadata": {
    "tokenMetadata": {
      "outputTokenCount": {
        "totalBillableCharacters": 1557,
        "totalTokens": 700
      },
      "inputTokenCount": {
        "totalBillableCharacters": 985,
        "totalTokens": 228
      }
    }
  }

We'll rerun with maxOutputTokens set to 1024, which yields the following. This time "U.S.-Latvia_relations" becomes "United_States%E2%80%93Latvia_relations". While both reference the same Wikipedia entry, the latter is a URI encoding of the UTF8 encoding of the EN DASH. Both refer to Unicode codepoint 8211, but this means that end applications will need to perform URI and UTF8 decoding of entries.

While it still fails to extract Washington and DC, it does extract "Chinese government" and "China's government" and resolves them both to "Government_of_China".

It once again hallucinates "Eastern_Flank_of_NATO" and this time replaces its earlier "Chinese_economic_coercion" hallucination with "Economic_coercion_by_the_People%27s_Republic_of_China". Simply by appending a few sentences, earlier entities are changed.

Unfortunately, it also incorrectly resolves Santiago Peña's name to the hallucinated "Santiago_P%C3%A9a", which is "Santiago_Péa".

[
  {
    "name": "Deputy Secretary of State Wendy Sherman",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Wendy_Sherman",
    "type": "Person"
  },
  {
    "name": "Latvian Foreign Minister Edgars Rinkevics",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Edgars_Rinkevicis",
    "type": "Person"
  },
  {
    "name": "Washington, D.C.",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Washington,_D.C.",
    "type": "Location"
  },
  {
    "name": "NATO",
    "wikipedia_url": "https://en.wikipedia.org/wiki/NATO",
    "type": "Organization"
  },
  {
    "name": "Vilnius",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Vilnius",
    "type": "Location"
  },
  {
    "name": "Eastern Flank",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Eastern_Flank_of_NATO",
    "type": "Other"
  },
  {
    "name": "Russian disinformation",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Russian_disinformation",
    "type": "Other"
  },
  {
    "name": "Russia",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Russia",
    "type": "Country"
  },
  {
    "name": "U.S.-Latvia coordination",
    "wikipedia_url": "https://en.wikipedia.org/wiki/United_States%E2%80%93Latvia_relations",
    "type": "Other"
  },
  {
    "name": "PRC economic coercion",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Economic_coercion_by_the_People%27s_Republic_of_China",
    "type": "Other"
  },
  {
    "name": "President Biden",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Joe_Biden",
    "type": "Person"
  },
  {
    "name": "China",
    "wikipedia_url": "https://en.wikipedia.org/wiki/China",
    "type": "Country"
  },
  {
    "name": "Chinese government",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Government_of_China",
    "type": "Organization"
  },
  {
    "name": "Joseph Biden",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Joe_Biden",
    "type": "Person"
  },
  {
    "name": "China's government",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Government_of_China",
    "type": "Organization"
  },
  {
    "name": "30 April",
    "wikipedia_url": "https://en.wikipedia.org/wiki/30_April",
    "type": "Date"
  },
  {
    "name": "Santiago Peña",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Santiago_P%C3%A9a",
    "type": "Person"
  },
  {
    "name": "Colorado Party",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)",
    "type": "Organization"
  },
  {
    "name": "presidential elections",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Presidential_election",
    "type": "Event"
  }
]

Given that it correctly resolved his name earlier, what changed this time?

Let's try a few different permutations of our prompt:

On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections.

This yields the correct results:

[
  {
    "name": "Santiago Peña",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Santiago_Peña",
    "type": "Person"
  },
  {
    "name": "Colorado Party",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)",
    "type": "Organization"
  },
  {
    "name": "presidential elections",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Presidential_election",
    "type": "Event"
  }
]

Let's prepend with a single sentence:

Wendy Sherman visited DC today. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections.

This yields the following. Note how DC is correctly recognized this time, even when it was not in our earlier examples. But, note how the Wikipedia entry for Peña's name is still wrong.

[
  {
    "name": "Wendy Sherman",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Wendy_Sherman",
    "type": "Person"
  },
  {
    "name": "DC",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Washington,_D.C.",
    "type": "Location"
  },
  {
    "name": "30 April",
    "wikipedia_url": "https://en.wikipedia.org/wiki/30_April",
    "type": "Date"
  },
  {
    "name": "Santiago Peña",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Santiago_P%C3%A9a",
    "type": "Person"
  },
  {
    "name": "Colorado Party",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)",
    "type": "Organization"
  },
  {
    "name": "presidential elections",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Presidential_election",
    "type": "Event"
  }
]

What about just mentioning Sherman's name?

Wendy Sherman. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections.

Peña's entry is still wrong, but strangely this causes all of the carriage returns and spaces to be removed from the output (though it is still valid JSON and passes verification):

[{"name": "Wendy Sherman", "wikipedia_url": "https://en.wikipedia.org/wiki/Wendy_Sherman", "type": "Person"}, {"name": "30 April", "wikipedia_url": "https://en.wikipedia.org/wiki/30_April", "type": "Date"}, {"name": "Santiago Peña", "wikipedia_url": "https://en.wikipedia.org/wiki/Santiago_P%C3%A9a", "type": "Person"}, {"name": "Colorado Party", "wikipedia_url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)", "type": "Organization"}, {"name": "presidential elections", "wikipedia_url": "https://en.wikipedia.org/wiki/Presidential_election", "type": "Event"}]

What about Joe Biden?

Joe Biden. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections.

This time we get the correct Peña entry:

[
  {
    "name": "Joe Biden",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Joe_Biden",
    "type": "Person"
  },
  {
    "name": "Santiago Peña",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Santiago_Peña",
    "type": "Person"
  },
  {
    "name": "Colorado Party",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)",
    "type": "Organization"
  },
  {
    "name": "presidential elections",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Presidential_election",
    "type": "Event"
  }
]

And President Biden?

President Joe Biden. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections.

Still correct:

[
  {
    "name": "Joe Biden",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Joe_Biden",
    "type": "Person"
  },
  {
    "name": "Santiago Peña",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Santiago_Peña",
    "type": "Person"
  },
  {
    "name": "Colorado Party",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)",
    "type": "Organization"
  },
  {
    "name": "30 April",
    "wikipedia_url": "https://en.wikipedia.org/wiki/April_30",
    "type": "Date"
  },
  {
    "name": "Presidential elections",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Presidential_election",
    "type": "Event"
  }
]

What about POTUS Joe Biden?

President of the United States Joe Biden. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections.

This time it hallucinates his name as "Santiago_Péña" (adding an accent over the e as well), in addition to stripping carriage returns and spacing again:

[{"name": "Joe Biden", "wikipedia_url": "https://en.wikipedia.org/wiki/Joe_Biden", "type": "Person"}, {"name": "United States", "wikipedia_url": "https://en.wikipedia.org/wiki/United_States", "type": "Country"}, {"name": "30 April", "wikipedia_url": "https://en.wikipedia.org/wiki/30_April", "type": "Date"}, {"name": "Santiago Peña", "wikipedia_url": "https://en.wikipedia.org/wiki/Santiago_P%C3%A9%C3%B1a", "type": "Person"}, {"name": "Colorado Party", "wikipedia_url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)", "type": "Organization"}, {"name": "Presidential elections", "wikipedia_url": "https://en.wikipedia.org/wiki/Presidential_election", "type": "Event"}]

What if we make Biden visit Paraguay?

President of the United States Joe Biden visited Paraguay. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections.

The results are the same:

[{"name": "Joe Biden", "wikipedia_url": "https://en.wikipedia.org/wiki/Joe_Biden", "type": "Person"}, {"name": "Paraguay", "wikipedia_url": "https://en.wikipedia.org/wiki/Paraguay", "type": "Location"}, {"name": "Santiago Peña", "wikipedia_url": "https://en.wikipedia.org/wiki/Santiago_P%C3%A9%C3%B1a", "type": "Person"}, {"name": "Colorado Party", "wikipedia_url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)", "type": "Organization"}, {"name": "30 April", "wikipedia_url": "https://en.wikipedia.org/wiki/30_April", "type": "Date"}, {"name": "Presidential elections", "wikipedia_url": "https://en.wikipedia.org/wiki/Presidential_election", "type": "Event"}]

What about just Paraguay?

Paraguay. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections.

We get the same:

[{"name": "Paraguay", "wikipedia_url": "https://en.wikipedia.org/wiki/Paraguay", "type": "Location"}, {"name": "Santiago Peña", "wikipedia_url": "https://en.wikipedia.org/wiki/Santiago_P%C3%A9%C3%B1a", "type": "Person"}, {"name": "Colorado Party", "wikipedia_url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)", "type": "Organization"}, {"name": "30 April", "wikipedia_url": "https://en.wikipedia.org/wiki/30_April", "type": "Date"}, {"name": "presidential elections", "wikipedia_url": "https://en.wikipedia.org/wiki/Presidential_election", "type": "Event"}]

What about if we actually make it into a real sentence:

In Paraguay on 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections.

This restores carriage returns and spacing, but still gives us the hallucinated Wikipedia entry:

[
  {
    "name": "Paraguay",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Paraguay",
    "type": "Location"
  },
  {
    "name": "30 April",
    "wikipedia_url": "https://en.wikipedia.org/wiki/30_April",
    "type": "Date"
  },
  {
    "name": "Santiago Peña",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Santiago_P%C3%A9%C3%B1a",
    "type": "Person"
  },
  {
    "name": "Colorado Party",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)",
    "type": "Organization"
  },
  {
    "name": "Presidential elections",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Presidential_election",
    "type": "Event"
  }
]

Even if we reword slightly, we get the same results:

On 30 April in Paraguay, Santiago Peña, from the ruling Colorado Party, won the presidential elections.

However, if we put Paraguay AFTER his name like this:

On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections in Paraguay.

Suddenly we get the correct results again:

[
  {
    "name": "Santiago Peña",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Santiago_Peña",
    "type": "Person"
  },
  {
    "name": "Colorado Party",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)",
    "type": "Organization"
  },
  {
    "name": "Paraguay",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Paraguay",
    "type": "Location"
  },
  {
    "name": "30 April",
    "wikipedia_url": "https://en.wikipedia.org/wiki/April_30",
    "type": "Date"
  },
  {
    "name": "Presidential elections",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Presidential_election",
    "type": "Event"
  }
]

In fact, if we put the second name anywhere after his name like:

On 30 April, Santiago Peña, from Paraguay\'s ruling Colorado Party, won the presidential elections.

We get the correct results again:

[
  {
    "name": "Santiago Peña",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Santiago_Peña",
    "type": "Person"
  },
  {
    "name": "Paraguay",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Paraguay",
    "type": "Location"
  },
  {
    "name": "Colorado Party",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Colorado_Party_(Paraguay)",
    "type": "Organization"
  },
  {
    "name": "30 April",
    "wikipedia_url": "https://en.wikipedia.org/wiki/April_30",
    "type": "Date"
  },
  {
    "name": "Presidential elections",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Presidential_election",
    "type": "Event"
  }
]

How does sentence ordering affect GCP's NLP API? The MID code for Santiago Peña is /g/11cls8sck7, but the API also outputs the Wikipedia URL for each entity so we can trivially check whether he is correctly disambiguated and resolved in each case:

gcloud ml language analyze-entities --content="On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections." | grep /Santiago_Peña
gcloud ml language analyze-entities --content="Wendy Sherman visited DC today. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections."| grep /Santiago_Peña
gcloud ml language analyze-entities --content="Wendy Sherman. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections."| grep /Santiago_Peña
gcloud ml language analyze-entities --content="Joe Biden. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections."| grep /Santiago_Peña
gcloud ml language analyze-entities --content="President Joe Biden. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections."| grep /Santiago_Peña
gcloud ml language analyze-entities --content="President of the United States Joe Biden. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections."| grep /Santiago_Peña
gcloud ml language analyze-entities --content="President of the United States Joe Biden visited Paraguay. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections."| grep /Santiago_Peña
gcloud ml language analyze-entities --content="Paraguay. On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections."| grep /Santiago_Peña
gcloud ml language analyze-entities --content="In Paraguay on 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections."| grep /Santiago_Peña
gcloud ml language analyze-entities --content="On 30 April in Paraguay, Santiago Peña, from the ruling Colorado Party, won the presidential elections."| grep /Santiago_Peña
gcloud ml language analyze-entities --content="On 30 April, Santiago Peña, from the ruling Colorado Party, won the presidential elections in Paraguay."| grep /Santiago_Peña
gcloud ml language analyze-entities --content="On 30 April, Santiago Peña, from Paraguay\'s ruling Colorado Party, won the presidential elections."| grep /Santiago_Peña

The API correctly identifies and resolves his name no matter how the sentence is ordered.

The GDELT Project

Experiments In Entity Extraction Using LLMs: Hallucination & How A Single Apostrophe Can Change The Results

Archives