Earlier today we showed how to use Web NGrams 3.0 and BigQuery to track mentions of "shortages of" across English language news coverage and display the resulting KWIC contextual snippets. While showcasing the exceptional power of GDELT's data in tracking the global risk landscape, a true horizon scanning system must be able to distill this information down, transforming a spreadsheet of thousands of entries into a histogram of the top most-discussed shortages by day over time. It turns out that with just a few extra lines of SQL, we can do exactly that!
What if instead of displaying the snippets accompanying each mention of "shortages of" in the ngrams dataset, we instead used a regular expression to extract the following word and constructed a daily histogram, displaying the top 10 entries per day over time? We can do this in a single SQL query:
select date, ARRAY_TO_STRING(entries, ',') shortages from ( select date, ARRAY_AGG(obj ORDER BY cnt DESC LIMIT 10) entries from ( SELECT DATE(date) date, LOWER(REGEXP_EXTRACT(post, r'of ([a-zA-Z].*?)[ !,.:;?]')) obj, count(1) cnt FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) >= "2022-09-15" and lang='en' and (LOWER(ngram) = 'shortage' OR LOWER(ngram) = 'shortages') and LOWER(post) like 'of %' group by date, obj having cnt > 1 and obj is not null order by cnt desc ) group by date order by date asc )
The inner-most query uses REGEXP_EXTRACT() to extract the word following "shortage(s)" from English-language coverage and collapses the precise article publication timestamp to daily resolution. To simplify matching hyphenated words like "f-16" and other edge cases, we use a manually-curated list of punctuation. This list of words is then aggregated by day and the top ten results returned for each day using ARRAY_AGG(). This actually generates the desired timeline, but in nested JSON format, which the outermost ARRAY_TO_STRING collapses into a comma-delimited list.
You can see the full list in the table below. A quick skim through this table shows childcare, worker, food, water and gasoline shortages receiving the most attention on September 15th. Worker, mental health, affordable housing, fuel and microchip shortages dominate on the 16th. Notably, the Moderna vaccine shortage appears in the top ten shortages beginning on the 22nd.
Obviously a production application would use an entity extraction library rather than simply extracting the following word by regex. Importantly, a production application would also use a much more flexible set of extraction rules, including dependency parses or neural IE, rather than the naive grammatical extraction used here. Despite these limitations, this showcases just how powerful the results can be from even a trivial extractor pipeline.
In some cases, knowing the number of times each entity was mentioned would be useful, allowing for prioritization and stratification. Using the same query, the ARRAY_AGG() can simply be modified to wrap around a STRUCT(), with an outer-most UNNEST() flattening for CSV download, yielding the revised SQL query:
select date, entry.obj object, entry.cnt count from ( select date, ARRAY_AGG(STRUCT(obj, cnt) ORDER BY cnt DESC LIMIT 10) entries from ( SELECT DATE(date) date, LOWER(REGEXP_EXTRACT(post, r'of ([a-zA-Z].*?)[ !,.:;?]')) obj, count(1) cnt FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) >= "2022-09-15" and lang='en' and (LOWER(ngram) = 'shortage' OR LOWER(ngram) = 'shortages') and LOWER(post) like 'of %' group by date, obj having cnt > 1 and obj is not null order by cnt desc ) group by date order by date asc ), UNNEST(entries) entry
This results in the following table, containing the same results as above, but with one word per row and the number of times it was mentioned.