The GDELT Project

New GKG 2.0 Article Metadata Fields

GDELT 2.0: The Planet in Realtime in 65 Languages and 2,300 Emotions and Themes

With the debut this week of our massive new document extraction platform, we are tremendously excited to unveil several new metadata fields added to the Global Knowledge Graph (GKG 2.0). As of yesterday all GKG 2.0 records will now include several new extracted fields in their "XMLExtras" field (the last column of each GKG record). Note that since these new metadata fields are extracted, rather than computed, not all documents may include all fields. We believe these new metadata fields will open up a number of new analytic and interface capabilities and look forward to what you are able to accomplish with them!

 

EXTRACTING FIELDS USING BIGQUERY

To extract one of these new fields using BigQuery, use REGEXP_EXTRACT. For example, to retrieve the AMP URL for articles with an AMP version available, use the following Legacy SQL query:

SELECT REGEXP_EXTRACT(Extras, r'<PAGE_ALTURL_AMP>(.*?)</PAGE_ALTURL_AMP>') FROM [gdeltv2.gkg] where Extras like '%<PAGE_ALTURL_AMP>%' order by DATE desc

Using Standard SQL, the query for links would be:

SELECT DATE, DocumentIdentifier, REGEXP_EXTRACT(Extras, r'<PAGE_LINKS>(.*?)</PAGE_LINKS>') links FROM `gdelt-bq.gdeltv2.gkg_partitioned` WHERE _PARTITIONTIME >= "2019-09-08 00:00:00" AND _PARTITIONTIME < "2019-09-09 00:00:00" 
and Extras like '%<PAGE_LINKS>%' LIMIT 10000