The Top Entities By Day Using BigQuery And The Cloud Natural Language API Through The GEG

The GDELT Global Entity Graph (GEG) dataset of 11 billion entities extracted by Google's Cloud Natural Language API from 103 million English language online news articles July 17, 2016 through October 4, 2019 offers an ideal semantic lens through which to understand the patterns of global news.

This past July, Felipe Hoffa showed how a single BigQuery query could process the new TV News Ngram Dataset to generate a list of the top terms by day according to their TFIDF scores, performing the entire analysis inside of BigQuery.

With just a few changes, the exact same approach can be used to analyze the 11 billion entities extracted by the Cluod Natural Language API over the last three years to surface the top entities of each type by day.

To demonstrate, we've created three CSV files that show the top 10 entities of each type extracted by the API by day, ranked by their TFIDF scores:

Creating each of these three datasets required only a single query and around 160 seconds with BigQuery.

The full query can be seen below.

WITH nested AS (

WITH data AS (
SELECT FORMAT_TIMESTAMP( "%m/%d/%E4Y", date) date, entities.name word, entities.numMentions count, entities.type type FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities
)
, word_day_type AS (
# how many times a word is mentioned in each "document"
SELECT word, SUM(count) counts, date, type
FROM data
GROUP BY 1, 3, 4
)
, day_type AS (
# total # of words in each "document"
SELECT SUM(count) counts, date, type
FROM data
GROUP BY 2,3
)
, tf AS (
# TF for a word in a "document"
SELECT word, date, type, a.counts/b.counts tf
FROM word_day_type a
JOIN day_type b
USING(date, type)
)
, word_in_docs AS (
# how many "documents" have a word
SELECT word, COUNT(DISTINCT FORMAT('%s %s', date, type)) indocs
FROM word_day_type
GROUP BY 1
)
, total_docs AS (
# total # of docs
SELECT COUNT(DISTINCT FORMAT('%s %s', date, type)) total_docs
FROM data
)
, idf AS (
# IDF for a word
SELECT word, LOG(total_docs.total_docs/indocs) idf
FROM word_in_docs
CROSS JOIN total_docs
)

SELECT date,
ARRAY_AGG(STRUCT(type, ARRAY_TO_STRING(words, ', ') AS wordlist) ORDER BY type) top_words
FROM (
SELECT date, type, ARRAY_AGG(word ORDER BY tfidf DESC LIMIT 10) words
FROM (
SELECT word, date, type, tf.tf * idf.idf tfidf
FROM tf
JOIN idf
USING(word)
)
GROUP BY date, type
)
GROUP BY date
ORDER BY date DESC

) select date,type,wordlist from nested, UNNEST(top_words)

To consider only capitalized entities, the only change you need to make to the query above is to change the very first select statement to the following:

SELECT FORMAT_TIMESTAMP( "%m/%d/%E4Y", date) date, entities.name word, entities.numMentions count, entities.type type FROM `gdelt-bq.gdeltv2.geg_gcnlapi` , unnest(entities) entities where REGEXP_CONTAINS(entities.name, r'^\p{Upper}')

To consider only entities with non-null MID IDs, the only change you need to make is to change the very first select statement to the following:

SELECT FORMAT_TIMESTAMP( "%m/%d/%E4Y", date) date, entities.name word, entities.numMentions count, entities.type type FROM `gdelt-bq.gdeltv2.geg_gcnlapi` , unnest(entities) entities where entities.mid is not null

Hopefully this gives you all sorts of new ideas for how the new GEG can be analyzed with just a single query, leveraging the power of BigQuery!