Creating A Daily Global Shortage Timeline Using Web NGrams 3.0 & BigQuery In One SQL Query

Earlier today we showed how to use Web NGrams 3.0 and BigQuery to track mentions of "shortages of" across English language news coverage and display the resulting KWIC contextual snippets. While showcasing the exceptional power of GDELT's data in tracking the global risk landscape, a true horizon scanning system must be able to distill this information down, transforming a spreadsheet of thousands of entries into a histogram of the top most-discussed shortages by day over time. It turns out that with just a few extra lines of SQL, we can do exactly that!

What if instead of displaying the snippets accompanying each mention of "shortages of" in the ngrams dataset, we instead used a regular expression to extract the following word and constructed a daily histogram, displaying the top 10 entries per day over time? We can do this in a single SQL query:

select date, ARRAY_TO_STRING(entries, ',') shortages from (
  select date, ARRAY_AGG(obj ORDER BY cnt DESC LIMIT 10) entries from (
       SELECT DATE(date) date, LOWER(REGEXP_EXTRACT(post, r'of ([a-zA-Z].*?)[ !,.:;?]')) obj, count(1) cnt FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) >= "2022-09-15" and lang='en' and 
       (LOWER(ngram) = 'shortage' OR LOWER(ngram) = 'shortages') and LOWER(post) like 'of %' group by date, obj having cnt > 1 and obj is not null order by cnt desc
  ) group by date order by date asc
)

The inner-most query uses REGEXP_EXTRACT() to extract the word following "shortage(s)" from English-language coverage and collapses the precise article publication timestamp to daily resolution. To simplify matching hyphenated words like "f-16" and other edge cases, we use a manually-curated list of punctuation. This list of words is then aggregated by day and the top ten results returned for each day using ARRAY_AGG(). This actually generates the desired timeline, but in nested JSON format, which the outermost ARRAY_TO_STRING collapses into a comma-delimited list.

You can see the full list in the table below. A quick skim through this table shows childcare, worker, food, water and gasoline shortages receiving the most attention on September 15th. Worker, mental health, affordable housing, fuel and microchip shortages dominate on the 16th. Notably, the Moderna vaccine shortage appears in the top ten shortages beginning on the 22nd.

date shortages
9/15/2022 such,childcare,the,workers,bizarre,gasoline,staff,food,public,water
9/16/2022 workers,the,mental,food,both,affordable,microchips,foreign,fuel,raw
9/17/2022 food,midwives,wonder,fertiliser,affordable,teachers,natural,the,things,foreign
9/18/2022 physicians,potential,water,affordable,other,essential,food,fuel,the,teachers
9/19/2022 childcare,food,foreign,staff,skilled,fueling,teachers,affordable,chips,water
9/20/2022 family,housing,specialists,computer,fueling,antiretroviral,food,skilled,difficult,chips
9/21/2022 difficult,computer,supply,specialists,manpower,the,workers,deep,infant,housing
9/22/2022 food,difficult,housekeeping,carers,computer,housing,foodgrains,the,moderna,dealer/casino
9/23/2022 fuel,moderna,food,critical,the,legal,gas,workers,female,housing
9/24/2022 fuel,troops,activewear,essential,food,labor,moderna,essentials,high-tech,the

Obviously a production application would use an entity extraction library rather than simply extracting the following word by regex. Importantly, a production application would also use a much more flexible set of extraction rules, including dependency parses or neural IE, rather than the naive grammatical extraction used here. Despite these limitations, this showcases just how powerful the results can be from even a trivial extractor pipeline.

In some cases, knowing the number of times each entity was mentioned would be useful, allowing for prioritization and stratification. Using the same query, the ARRAY_AGG() can simply be modified to wrap around a STRUCT(), with an outer-most UNNEST() flattening for CSV download, yielding the revised SQL query:

select date, entry.obj object, entry.cnt count from (
  select date, ARRAY_AGG(STRUCT(obj, cnt) ORDER BY cnt DESC LIMIT 10) entries from (
       SELECT DATE(date) date, LOWER(REGEXP_EXTRACT(post, r'of ([a-zA-Z].*?)[ !,.:;?]')) obj, count(1) cnt FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) >= "2022-09-15" and lang='en' and 
       (LOWER(ngram) = 'shortage' OR LOWER(ngram) = 'shortages') and LOWER(post) like 'of %' group by date, obj having cnt > 1 and obj is not null order by cnt desc
  ) group by date order by date asc
), UNNEST(entries) entry

This results in the following table, containing the same results as above, but with one word per row and the number of times it was mentioned.

date object count
9/15/2022 such 81
9/15/2022 childcare 40
9/15/2022 the 33
9/15/2022 workers 30
9/15/2022 bizarre 20
9/15/2022 staff 19
9/15/2022 gasoline 19
9/15/2022 food 18
9/15/2022 water 17
9/15/2022 public 17
9/16/2022 workers 107
9/16/2022 mental 26
9/16/2022 the 26
9/16/2022 food 23
9/16/2022 both 20
9/16/2022 affordable 20
9/16/2022 microchips 19
9/16/2022 foreign 18
9/16/2022 fuel 14
9/16/2022 skilled 13
9/17/2022 food 20
9/17/2022 midwives 18
9/17/2022 wonder 16
9/17/2022 fertiliser 16
9/17/2022 affordable 14
9/17/2022 teachers 13
9/17/2022 the 11
9/17/2022 natural 11
9/17/2022 things 11
9/17/2022 foreign 11
9/18/2022 physicians 33
9/18/2022 potential 21
9/18/2022 water 18
9/18/2022 affordable 17
9/18/2022 other 16
9/18/2022 essential 15
9/18/2022 food 14
9/18/2022 fuel 12
9/18/2022 the 9
9/18/2022 teachers 8
9/19/2022 childcare 85
9/19/2022 food 28
9/19/2022 foreign 25
9/19/2022 staff 21
9/19/2022 skilled 20
9/19/2022 fueling 17
9/19/2022 teachers 17
9/19/2022 affordable 16
9/19/2022 chips 15
9/19/2022 the 13
9/20/2022 family 88
9/20/2022 housing 81
9/20/2022 specialists 56
9/20/2022 computer 39
9/20/2022 fueling 33
9/20/2022 antiretroviral 27
9/20/2022 food 26
9/20/2022 skilled 23
9/20/2022 difficult 22
9/20/2022 chips 21
9/21/2022 difficult 247
9/21/2022 computer 135
9/21/2022 supply 48
9/21/2022 specialists 41
9/21/2022 manpower 25
9/21/2022 the 24
9/21/2022 workers 23
9/21/2022 deep 22
9/21/2022 infant 21
9/21/2022 housing 20
9/22/2022 food 68
9/22/2022 difficult 53
9/22/2022 housekeeping 48
9/22/2022 carers 47
9/22/2022 computer 35
9/22/2022 housing 28
9/22/2022 foodgrains 26
9/22/2022 the 25
9/22/2022 moderna 23
9/22/2022 dealer/casino 22
9/23/2022 fuel 178
9/23/2022 moderna 149
9/23/2022 food 112
9/23/2022 critical 86
9/23/2022 the 30
9/23/2022 legal 29
9/23/2022 gas 23
9/23/2022 workers 17
9/23/2022 female 16
9/23/2022 housing 16
9/24/2022 fuel 39
9/24/2022 troops 29
9/24/2022 activewear 24
9/24/2022 essential 21
9/24/2022 food 16
9/24/2022 labor 15
9/24/2022 essentials 14
9/24/2022 moderna 14
9/24/2022 high-tech 10
9/24/2022 the 8