New November 2021 GKG 2.0 Themes Lookup

Advanced users are frequently interested in a list of all of the themes found in the GDELT Global Knowledge Graph (GKG)'s Themes and V2Themes field in order to look for themes related to their topics of interest and to plan large-scale analyses. The last lookup file was generated two years ago so today we have updated the GKG Themes lookup file.

Using a single SQL query in BigQuery it took just 29 seconds to process 569GB of Themes data and the 28.6 billion theme-document pairings to generate the final lookup. If applied to the V2Themes field, it takes 38 seconds to process the 1.2TB of thematic data total 48.1 billion theme-document pairings.

The final lookup is available for download below:

Computing this histogram is as simple as using the query below:

WITH nested AS (
SELECT SPLIT(RTRIM(Themes,';'),';') themes FROM `gdelt-bq.gdeltv2.gkg_partitioned` WHERE length(Themes) > 1
) select theme, count(1) cnt from nested, UNNEST(themes) as theme group by theme order by cnt desc

Computing a histogram over the V2Themes field is as simple as:

WITH nested AS (
SELECT SPLIT(RTRIM(REGEXP_REPLACE(V2Themes, r',\d+;', ';'), ';'),';') themes FROM `gdelt-bq.gdeltv2.gkg_partitioned` WHERE length(V2Themes) > 1
) select theme, count(1) cnt from nested, UNNEST(themes) as theme group by theme order by cnt desc