The GDELT Global Entity Graph (GEG) dataset of 11 billion entities extracted by Google's Cloud Natural Language API from 103 million English language online news articles July 17, 2016 through October 4, 2019 offers an ideal semantic lens through which to understand the patterns of global news.
This past July, Felipe Hoffa showed how a single BigQuery query could process the new TV News Ngram Dataset to generate a list of the top terms by day according to their TFIDF scores, performing the entire analysis inside of BigQuery.
With just a few changes, the exact same approach can be used to analyze the 11 billion entities extracted by the Cluod Natural Language API over the last three years to surface the top entities of each type by day.
To demonstrate, we've created three CSV files that show the top 10 entities of each type extracted by the API by day, ranked by their TFIDF scores:
Creating each of these three datasets required only a single query and around 160 seconds with BigQuery.
The full query can be seen below.
WITH nested AS ( WITH data AS ( SELECT FORMAT_TIMESTAMP( "%m/%d/%E4Y", date) date, entities.name word, entities.numMentions count, entities.type type FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities ) , word_day_type AS ( # how many times a word is mentioned in each "document" SELECT word, SUM(count) counts, date, type FROM data GROUP BY 1, 3, 4 ) , day_type AS ( # total # of words in each "document" SELECT SUM(count) counts, date, type FROM data GROUP BY 2,3 ) , tf AS ( # TF for a word in a "document" SELECT word, date, type, a.counts/b.counts tf FROM word_day_type a JOIN day_type b USING(date, type) ) , word_in_docs AS ( # how many "documents" have a word SELECT word, COUNT(DISTINCT FORMAT('%s %s', date, type)) indocs FROM word_day_type GROUP BY 1 ) , total_docs AS ( # total # of docs SELECT COUNT(DISTINCT FORMAT('%s %s', date, type)) total_docs FROM data ) , idf AS ( # IDF for a word SELECT word, LOG(total_docs.total_docs/indocs) idf FROM word_in_docs CROSS JOIN total_docs ) SELECT date, ARRAY_AGG(STRUCT(type, ARRAY_TO_STRING(words, ', ') AS wordlist) ORDER BY type) top_words FROM ( SELECT date, type, ARRAY_AGG(word ORDER BY tfidf DESC LIMIT 10) words FROM ( SELECT word, date, type, tf.tf * idf.idf tfidf FROM tf JOIN idf USING(word) ) GROUP BY date, type ) GROUP BY date ORDER BY date DESC ) select date,type,wordlist from nested, UNNEST(top_words)
To consider only capitalized entities, the only change you need to make to the query above is to change the very first select statement to the following:
SELECT FORMAT_TIMESTAMP( "%m/%d/%E4Y", date) date, entities.name word, entities.numMentions count, entities.type type FROM `gdelt-bq.gdeltv2.geg_gcnlapi` , unnest(entities) entities where REGEXP_CONTAINS(entities.name, r'^\p{Upper}')
To consider only entities with non-null MID IDs, the only change you need to make is to change the very first select statement to the following:
SELECT FORMAT_TIMESTAMP( "%m/%d/%E4Y", date) date, entities.name word, entities.numMentions count, entities.type type FROM `gdelt-bq.gdeltv2.geg_gcnlapi` , unnest(entities) entities where entities.mid is not null
Hopefully this gives you all sorts of new ideas for how the new GEG can be analyzed with just a single query, leveraging the power of BigQuery!