Advanced users are frequently interested in a list of all of the themes found in the GDELT Global Knowledge Graph (GKG)'s Themes and V2Themes field in order to look for themes related to their topics of interest and to plan large-scale analyses. The last lookup file was generated two years ago so today we have updated the GKG Themes lookup file.
Using a single SQL query in BigQuery it took just 21 seconds to process 433GB of Themes data and the 21.9 billion theme-document pairings to generate the final lookup. If applied to the V2Themes field, it takes 33 seconds to process the 898GB of thematic data total 36.6 billion theme-document pairings.
The final lookup is available for download below:
For those interested in the details of how this lookup was created, the following SQL query was used in BigQuery, with the results exported in CSV format and reformatted on a GCE instance to the TSV file above (remove the _PARTITION_TIME parameters to run over the entire dataset).
WITH nested AS ( SELECT SPLIT(RTRIM(Themes,';'),';') themes FROM `gdelt-bq.gdeltv2.gkg_partitioned` WHERE _PARTITIONTIME >= "2019-09-04 00:00:00" AND _PARTITIONTIME < "2019-09-05 00:00:00" and length(Themes) > 1 ) select theme, count(1) cnt from nested, UNNEST(themes) as theme group by theme order by cnt desc