Querying The New Global Entity Graph (GEG) Datasets In BigQuery: Example Queries

Both the neural GEG (based on Google's Cloud Natural Language API) and classical GEG (based on a traditional HMM POS tagger + grammar) are available in BigQuery. The GEG datasets use JSON arrays to store the entity lists, which requires the use of the "UNNEST()" operator in BigQuery to access. The examples below showcase how to query the two datasets.

Here is a simple example of requesting all entities from the neural Cloud Natural Language API-powered GEG that were disambiguated by the API to a unique MID ID, ranked by salience.

SELECT url, entities.type, entities.name, entities.mid, entities.wikipediaUrl,entities.avgSalience FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities
where entities.mid is not null and date >= "2019-02-05 00:00:00" AND date < "2019-02-06 00:00:00" order by entities.avgSalience desc limit 10000

Here is the same query, but returning all entities identified by the API, not just those it was able to resolve to a specific MID ID.

SELECT url, entities.type, entities.name, entities.mid, entities.wikipediaUrl,entities.avgSalience FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities
where date >= "2019-02-05 00:00:00" AND date < "2019-02-06 00:00:00" order by entities.avgSalience desc limit 10000

Here is the same query applied to the classical G1 Baseline GEG.

SELECT url, entities.type, entities.name, entities.avgSalience FROM `gdelt-bq.gdeltv2.geg_g1`, unnest(entities) entities
where date >= "2019-02-06 00:00:00" AND date < "2019-02-07 00:00:00" order by entities.avgSalience desc limit 10000

Just proper names from the G1 Baseline GEG, ordered by salience.

SELECT url, entities.type, entities.name, entities.avgSalience FROM `gdelt-bq.gdeltv2.geg_g1`, unnest(entities) entities
where entities.type='PROPER' and date >= "2019-02-06 00:00:00" AND date < "2019-02-07 00:00:00" order by entities.avgSalience desc limit 10000

Just numbers from the G1 Baseline GEG, ordered by salience and adding in the "context" field to the returned results.

SELECT url, entities.type, entities.name, entities.context, entities.avgSalience FROM `gdelt-bq.gdeltv2.geg_g1`, unnest(entities) entities
where entities.type='NUMBER' and date >= "2019-02-06 00:00:00" AND date < "2019-02-07 00:00:00" order by entities.avgSalience desc limit 10000

Note that this returns values like "100 homes evacuated" rather than just flagging that "100" was found in the article, reflecting how powerful context is in helping to understand numeric counts.

Here is an example of finding all articles mentioning Ebola, using the neural GEG.

SELECT url, entities.name, entities.type, entities.numMentions, entities.mid, entities.wikipediaUrl,entities.avgSalience FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities
where LOWER(entities.name)='ebola' and date >= "2019-02-05 00:00:00" AND date < "2019-02-06 00:00:00" order by entities.avgSalience desc limit 10000

These examples should help get you started with using these incredibly powerful new datasets!