Using Google's Cloud Inference API To Explore The Natural Language API's Annotations Of 100 Million News Articles

Last month we unveiled the Global Entity Graph (GEG), a massive dataset of more than 11 billion entity annotations over 100 million English-language global news articles 2016-2019, annotated by Google's Cloud Natural Language API. We've previously shown how the co-occurrences of this enormous dataset can be visualized as an immense 24.5-billion-edge graph structure, but beyond macro-level visualizations, how can these 11 billion annotations be interactively explored to answer real-world questions?

Enter Google's Cloud Inference API.

Last year we showed how the Inference API could be used to explore a month of Cloud Vision API annotations, illustrating how it was able to surface nuanced and complex trends not visible in traditional histogram analyses. What might it look like to apply the Inference API to a massive textual entity dataset?

The Inference API today supports loading data directly from BigQuery, making the process of analyzing large datasets trivially easy.

The Global Entity Graph is already available in BigQuery, so importing it into the Inference API requires nothing more than simply reformatting it to the Inference API's required structure and deciding which features from the GEG to use.

While the Natural Language API identifies all entities found in each document, for a smaller subset of entities it assigns a unique ID code (a "MID" code) that groups together all name variants, allowing normalization. For example, mentions of the "U.S. Federal Reserve," "Federal Reserve," "Federal Reserve Board," "New York Fed," "Atlanta Fed," "St. Louis Fed" and even just "The Fed" and "Fed" all resolve to the unique entity ID "/m/02xmb" making it trivial to perform name normalization.

In all, the Natural Language API has identified 13,892,261 distinct MID codes for the 100 million news articles annotated as part of the GEG.

The Inference API requires a unique ID for each "group" (in this case a news article). While the GEG uses the article URL as its unique key, the Inference API requires a key of INTEGER type. Happily, BigQuery provides an implementation of FarmHash with FARM_FINGERPRINT() that can translate URLs into the unique INTEGER IDs required by the Inference API.

Creating the BigQuery import table for Inference API thus required just two SQL queries.

The first created the initial table and populated it with the domain name of each article:

SELECT FARM_FINGERPRINT(url) group_id, 'PageDomain' data_name, NET.REG_DOMAIN(url) data_value, date start_time FROM `gdelt-bq.gdeltv2.geg_gcnlapi`

The query uses FARM_FINGERPRINT() to convert URLs to the numeric IDs and BigQuery's "NET.REG_DOMAIN()" to extract the root domain of each URL. In this case, subdomains like "arabic.cnn.com" will be converted to their root "cnn.com", which for the purposes of this analysis was preferable since the GEG is a small sampled dataset, but could be easily swapped for any of BigQuery's other domain parsing functions.

The real workhorse of the translation process involves transforming the nested entity arrays of the GEG into the flattened records required by the Inference API, while normalizing the names of each entity. This is accomplished with a single query, whose output was appended to the table from above:

select group_id, data_name, a.entity data_value, start_time from (
  SELECT FARM_FINGERPRINT(url) group_id, CONCAT('Entity',entity.type) data_name, date start_time, entity.mid mid FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, UNNEST(entities) entity WHERE entity.mid is not null
) b JOIN (
  SELECT APPROX_TOP_COUNT(entities.name, 1)[OFFSET(0)].value entity, entities.mid mid FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities where entities.mid is not null group by entities.mid
) a USING(mid)

The first half of this query flattens the nested entity arrays using UNNEST() and filters out entities which do not have an associated MID code. It concatenates the word "Entity" with the actual entity type returned by the Natural Language API, storing the entities in the Inference API by type, allowing for type-based querying.

The second half of the query computes the most common name associated with each MID. While MID "/m/02xmb" might refer to "The Fed" and "Atlanta Fed" and "New York Fed" and many other names, this query returns that the most common name it appears with is "Federal Reserve." The join combines these two so that every entity in the dataset with MID "/m/02xmb" is renamed to "Federal Reserve," normalizing the names. Despite this large join, the query took just 2.5 minutes to complete.

The end result of the two queries was a 72GB temporary table with 1,688,068,234 records.

Next, sign up for the Cloud Inference API and follow the instructions to get it set up for your account. Note that for those familiar with earlier Google APIs for machine learning, the Cloud Inference API uses Google's modern authentication flow.

When loading a BigQuery table into the Inference API, the load command must specify the list of unique "data_name" values in the dataset. A final SQL query was used to compile this list:

SELECT data_name, count(1) cnt FROM `[TEMPORARYTABLE]` group by data_name order by data_name asc

Using this list of field names, it takes only a single CURL command to load the temporary BigQuery table into the Inference API (replace "[TEMPORARYTABLE]" in the "uri" field with the table of the temporary table created above and "[YOURPROJECTID]" with the numeric ID of your GCP project):

curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets -d '{
  "name":"gdeltbq_geg_v1",
  "data_names": [
    "EntityCONSUMER_GOOD",
    "EntityEVENT",
    "EntityLOCATION",
    "EntityORGANIZATION",
    "EntityOTHER",
    "EntityPERSON",
    "EntityUNKNOWN",
    "EntityWORK_OF_ART",
    "PageDomain"
  ],
  "data_sources": [
    { "uri":"[TEMPORARYTABLE]" },
  ]
}'

This command will cause the Inference API to begin loading the data in the background into a new Inference API dataset called "gdeltbq_geg_v1". You can check the status of the load by running:

curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets

Once the data finishes loading in around 30 minutes or so, we're ready to start interactively exploring the underlying trends of 1.6 billion entity annotations over 100 million news articles!

Given the amount of media coverage of Special Counsel Robert Mueller over the time period covered by the GEG, an easy first question might be what other people are most closely associated with him? In other words, not what names appear the most in articles mentioning Mueller (which may also appear in many non-Mueller articles), but rather what names appear more in articles mentioning Mueller than they do in articles that don't mention Mueller.

Accomplishing this with the Inference API is trivial:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "EntityPERSON",
      "value": "Robert Mueller"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "EntityPERSON",
      "maxResultEntries": 50,
    }
  }]
}' > RESULTS

The query above searches for all articles that contain "Robert Mueller" as an "EntityPERSON" and then returns the other "EntityPERSON" values that are most closely associated with it.

The resulting names include Paul Manafort, Rod Rosenstein, William Barr, Rick Gates, Jerrold Nadler, James Comey, Don McGahn and Michael Cohen, which represent a cross-section of the biggest names in the Mueller story of the past three years.

Most importantly, despite searching the correlations of more than 1.6 billion entities across 100 million groupings, the API took less than a second to return these results, permitting realtime adhoc analysis.

What about the top organizations most closely associated with Mueller? Simply switching out "EntityPERSON" for "EntityORGANIZATION" is all that's required:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "EntityPERSON",
      "value": "Robert Mueller"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "EntityORGANIZATION",
      "maxResultEntries": 50,
    }
  }]
}' > RESULTS

This yields House Judiciary Committee, Justice Department, House Intelligence Committee, Special Counsel, FBI, Senate Intelligence Committee and the Internet Research Agency – again, a perfectly representative cross-section. Note that the presence of the Internet Research Agency on this list is a reminder of the Inference API's powerful ability to surface co-occurrences that would not be caught by traditional semantic relatedness graph.

Identifying the top geographic locations again requires just a single parameter change:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "EntityPERSON",
      "value": "Robert Mueller"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "EntityLOCATION",
      "maxResultEntries": 50,
    }
  }]
}' > RESULTS

This time yielding Trump Tower, Russia, White House, Watergate, Washington, Capitol Hill and Ukraine.

Repeating again for the top "events":

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "EntityPERSON",
      "value": "Robert Mueller"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "EntityEVENT",
      "maxResultEntries": 50,
    }
  }]
}' > RESULTS

Offers Witch Hunt, Saturday Night Massacre, Watergate, Republican National Convention and Russiagate.

And again for "works of art":

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "EntityPERSON",
      "value": "Robert Mueller"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "EntityWORK_OF_ART",
      "maxResultEntries": 50,
    }
  }]
}' > RESULTS

Yields the Mueller Report, Foreign Intelligence Surveillance Act, Foreign Agents Registration Act, Steele Dossier, 25th Amendment, A Higher Loyalty (a book by former FBI Director James Comey) and the Fifth Amendment.

What were the top news outlets that covered the story more than others? Once again, a single query yields the answer in less than a second:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "EntityPERSON",
      "value": "Robert Mueller"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "PageDomain",
      "maxResultEntries": 50,
    }
  }]
}' > RESULTS

Resulting in thehill.com, theweek.com, rawstory.com, alternet.org, politico.com and axios.com.

The time period over which terms are accumulated can also be adjusted, allowing one to inquire about the top names during a specific time period rather than overall. For example, instead of asking for the top names associated with Mueller over the past three years, what about the top names during the day of September 25, 2019, when the Ukraine story broke?

Doing so requires simply adding "restrictStartTime" and "restrictEndTime" to the query:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "EntityPERSON",
      "value": "Robert Mueller"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "EntityPERSON",
      "maxResultEntries": 50,
    },
    "restrictStartTime": "2019-09-25T00:00:00Z",
    "restrictEndTime": "2019-09-25T23:59:59Z"
  }]
}' > RESULTS

Whereas the original three-year query yielded Paul Manafort, Rod Rosenstein, William Barr, Rick Gates, Jerrold Nadler, James Comey, Don McGahn and Michael Cohen, this single-day query instead yields Volodymyr Zelenskiy, Hunter Biden, Joe Biden, John Lewis and Nancy Pelosi, showing how the Inference API can be used to explore both broad and narrow timespans with ease.

Similarly, the query below filters by news outlet, asking for the top names mentioned in cnn.com's coverage on March 24, 2019, the day Attorney General William Barr released his summary of the Mueller Report.

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": "TYPE_AND",
        "children": [
          {
            "type": "TYPE_TERM",
            "term": {
              "name": "PageDomain",
              "value": "cnn.com"
            }
          },
          {
            "type": "TYPE_TERM",
            "term": {
              "name": "ted",
              "value": "17980"
            }
          },
        ],
     },
    "distributionConfigs": {
      "dataName": "EntityPERSON",
      "maxResultEntries": 50,
      "bgprobExp": 0.3
    }
  }]
}' > RESULTS

This yields William Barr, Robert Mueller, Frank Pallotta (a CNN media reporter), Jerrold Nadler, Roger Rosner (Apple VP making several announcements at Apple's press event) and Rod Rosenstein. The inclusion of Rosner's name reminds us that despite its outsized media coverage, the release of the Mueller Report summary was far from the only news that day.

Switching out "cnn.com" for "politico.com" allows a comparison of their coverage of the day:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": "TYPE_AND",
        "children": [
          {
            "type": "TYPE_TERM",
            "term": {
              "name": "PageDomain",
              "value": "politico.com"
            }
          },
          {
            "type": "TYPE_TERM",
            "term": {
              "name": "ted",
              "value": "17980"
            }
          },
        ],
     },
    "distributionConfigs": {
      "dataName": "EntityPERSON",
      "maxResultEntries": 50,
      "bgprobExp": 0.3
    }
  }]
}' > RESULTS

Politico's politics-focused coverage is apparent in its lack of Apple-related names, with William Barr, Robert Mueller, Darren Samuelsohn (Politico senior White House reporter), Nancy Pelosi and Jerrold Nadler. As with CNN, Politico also saw one of its reporters in the most-associated list for the day.

What about other names in the news beyond Mueller? The query below finds the top people most closely associated with Uber:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "EntityORGANIZATION",
      "value": "Uber"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "EntityPERSON",
      "maxResultEntries": 50,
    }
  }]
}' > RESULTS

This yields Travis Kalanick, Dara Khosrowshahi, Garrett Camp, Elaine Herzberg, Anthony Levandowski and Emil Michael.

Similarly, the top organizations related to Elon Musk:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "EntityPerson",
      "value": "Elon Musk"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "EntityORGANIZATION",
      "maxResultEntries": 50,
    }
  }]
}' > RESULTS

Yields SpaceX, Tesla, Boring Company, SolarCity and Blue Origin. The inclusion of Blue Origin is intriguing and captures that quite a bit of the news coverage of Blue Origin references Musk's SpaceX, showing how rival companies are being described in the news in the context of his own efforts.

Similarly, other people associated with Musk:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "EntityPerson",
      "value": "Elon Musk"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "EntityPERSON",
      "maxResultEntries": 50,
    }
  }]
}' > RESULTS

This yields Gwynne Shotwell, Yusaku Maezawa, Robyn Denholm and Deepak Ahuja.

Top locations associated with Musk include:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "EntityPerson",
      "value": "Elon Musk"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "EntityLOCATION",
      "maxResultEntries": 50,
    }
  }]
}' > RESULTS

Yielding Hawthorne (the location of Boring Company's tunnel), Kennedy Space Center, Gigafactory, Cape Canaveral Air Force Station, Mars and the International Space Station. That Mars is now so strongly associated with Musk stands testament to his outsized influence in societal conversation about space today.

While all of these are powerful examples in their own right, one of the most exciting capabilities of the Inference API lies in its ability to guide unexpected exploration. Instead of asking what names are most associated with Mueller, what might it look like to ask what days were most associated with him?

In short, to give the Inference API a name (Mueller in this case) and ask it to find the dates most closely associated with that name. As with all things Inference API-related, this takes just a single query:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "EntityPERSON",
      "value": "Robert Mueller"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "ted",
      "maxResultEntries": 50,
    }
  }]
}' > RESULTS

The first three dates returned are Mar 24, 2019 (Attorney General William Barr's summary of the Mueller Report), April 18, 2019 (release of the full Mueller Report) and May 17 2017 (Mueller appointed special counsel).

What about the dates most closely associated with an organization, like SpaceX?

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "EntityOrganization",
      "value": "SpaceX"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "ted",
      "maxResultEntries": 50,
    }
  }]
}' > RESULTS

This yields top dates of August 13, 2017 (Dragon ISS resupply mission press kit release prior to launch), February 6, 2018 (Falcon Heavy launch) and September 17, 2018 (announcement of first lunar mission passenger).

As with people, given an organization name, the Inference API can find the most important dates associated with it, demonstrating its ability to examine temporal trends in large document archives.

Finally, what about asking about a given day by itself, without any other search parameters? For example, what were the top person names associated with March 24, 2019 across all worldwide online news coverage on all topics contained in the GEG?

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/gdeltbq_geg_v1:query \
  -d'{
  "name": "gdeltbq_geg_v1",
  "queries": [{
    "query": {
      "type": "TYPE_AND",
        "children": [
          {
            "type": "TYPE_TERM",
            "term": {
              "name": "ted",
              "value": "17980"
            }
          },
        ],
     },
    "distributionConfigs": {
      "dataName": "EntityPERSON",
      "maxResultEntries": 50,
      "bgprobExp": 0.3
    }
  }]
}' > RESULTS

The end result includes William Barr, Robert Mueller, Jerrold Nadler and Rod Rosenstein, reflecting how globally significant the release of Barr's summary of the Mueller Report was that day.

Putting this all together, we started with a sample of 100 million worldwide English language news articles spanning three years and used the Cloud Natural Language API to compile a list of the 11 billion entities they mention. With just two SQL queries we quickly translated this dataset into the Inference API's format and imported it from BigQuery. We were then able to interactively explore this enormous dataset, teasing apart the underlying topical patterns of three years of global news coverage, from identifying the names, organizations, locations, events, works of art and dates most closely associated with people like Robert Mueller and Elon Musk, companies like Uber and SpaceX, specific news outlets like CNN and Politico and even have the Inference API point us to the most meaningful dates associated with each entity.

In the end, these examples not only showcase the incredible capabilities of the Inference API, but demonstrate how it can be used as a powerful tool to understand the deeper patterns of vast content archives when coupled with tools like the Cloud Vision API or Cloud Natural Language API.