Creating A Daily Global Shortage Timeline Using Web NGrams 3.0 & BigQuery In One SQL Query

Earlier today we showed how to use Web NGrams 3.0 and BigQuery to track mentions of "shortages of" across English language news coverage and display the resulting KWIC contextual snippets. While showcasing the exceptional power of GDELT's data in tracking the global risk landscape, a true horizon scanning system must be able to distill this information down, transforming a spreadsheet of thousands of entries into a histogram of the top most-discussed shortages by day over time. It turns out that with just a few extra lines of SQL, we can do exactly that!

What if instead of displaying the snippets accompanying each mention of "shortages of" in the ngrams dataset, we instead used a regular expression to extract the following word and constructed a daily histogram, displaying the top 10 entries per day over time? We can do this in a single SQL query:

select date, ARRAY_TO_STRING(entries, ',') shortages from (
  select date, ARRAY_AGG(obj ORDER BY cnt DESC LIMIT 10) entries from (
       SELECT DATE(date) date, LOWER(REGEXP_EXTRACT(post, r'of ([a-zA-Z].*?)[ !,.:;?]')) obj, count(1) cnt FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) >= "2022-09-15" and lang='en' and 
       (LOWER(ngram) = 'shortage' OR LOWER(ngram) = 'shortages') and LOWER(post) like 'of %' group by date, obj having cnt > 1 and obj is not null order by cnt desc
  ) group by date order by date asc
)

The inner-most query uses REGEXP_EXTRACT() to extract the word following "shortage(s)" from English-language coverage and collapses the precise article publication timestamp to daily resolution. To simplify matching hyphenated words like "f-16" and other edge cases, we use a manually-curated list of punctuation. This list of words is then aggregated by day and the top ten results returned for each day using ARRAY_AGG(). This actually generates the desired timeline, but in nested JSON format, which the outermost ARRAY_TO_STRING collapses into a comma-delimited list.

You can see the full list in the table below. A quick skim through this table shows childcare, worker, food, water and gasoline shortages receiving the most attention on September 15th. Worker, mental health, affordable housing, fuel and microchip shortages dominate on the 16th. Notably, the Moderna vaccine shortage appears in the top ten shortages beginning on the 22nd.

date	shortages
9/15/2022	such,childcare,the,workers,bizarre,gasoline,staff,food,public,water
9/16/2022	workers,the,mental,food,both,affordable,microchips,foreign,fuel,raw
9/17/2022	food,midwives,wonder,fertiliser,affordable,teachers,natural,the,things,foreign
9/18/2022	physicians,potential,water,affordable,other,essential,food,fuel,the,teachers
9/19/2022	childcare,food,foreign,staff,skilled,fueling,teachers,affordable,chips,water
9/20/2022	family,housing,specialists,computer,fueling,antiretroviral,food,skilled,difficult,chips
9/21/2022	difficult,computer,supply,specialists,manpower,the,workers,deep,infant,housing
9/22/2022	food,difficult,housekeeping,carers,computer,housing,foodgrains,the,moderna,dealer/casino
9/23/2022	fuel,moderna,food,critical,the,legal,gas,workers,female,housing
9/24/2022	fuel,troops,activewear,essential,food,labor,moderna,essentials,high-tech,the

Obviously a production application would use an entity extraction library rather than simply extracting the following word by regex. Importantly, a production application would also use a much more flexible set of extraction rules, including dependency parses or neural IE, rather than the naive grammatical extraction used here. Despite these limitations, this showcases just how powerful the results can be from even a trivial extractor pipeline.

In some cases, knowing the number of times each entity was mentioned would be useful, allowing for prioritization and stratification. Using the same query, the ARRAY_AGG() can simply be modified to wrap around a STRUCT(), with an outer-most UNNEST() flattening for CSV download, yielding the revised SQL query:

select date, entry.obj object, entry.cnt count from (
  select date, ARRAY_AGG(STRUCT(obj, cnt) ORDER BY cnt DESC LIMIT 10) entries from (
       SELECT DATE(date) date, LOWER(REGEXP_EXTRACT(post, r'of ([a-zA-Z].*?)[ !,.:;?]')) obj, count(1) cnt FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) >= "2022-09-15" and lang='en' and 
       (LOWER(ngram) = 'shortage' OR LOWER(ngram) = 'shortages') and LOWER(post) like 'of %' group by date, obj having cnt > 1 and obj is not null order by cnt desc
  ) group by date order by date asc
), UNNEST(entries) entry

This results in the following table, containing the same results as above, but with one word per row and the number of times it was mentioned.

date	object	count
9/15/2022	such	81
9/15/2022	childcare	40
9/15/2022	the	33
9/15/2022	workers	30
9/15/2022	bizarre	20
9/15/2022	staff	19
9/15/2022	gasoline	19
9/15/2022	food	18
9/15/2022	water	17
9/15/2022	public	17
9/16/2022	workers	107
9/16/2022	mental	26
9/16/2022	the	26
9/16/2022	food	23
9/16/2022	both	20
9/16/2022	affordable	20
9/16/2022	microchips	19
9/16/2022	foreign	18
9/16/2022	fuel	14
9/16/2022	skilled	13
9/17/2022	food	20
9/17/2022	midwives	18
9/17/2022	wonder	16
9/17/2022	fertiliser	16
9/17/2022	affordable	14
9/17/2022	teachers	13
9/17/2022	the	11
9/17/2022	natural	11
9/17/2022	things	11
9/17/2022	foreign	11
9/18/2022	physicians	33
9/18/2022	potential	21
9/18/2022	water	18
9/18/2022	affordable	17
9/18/2022	other	16
9/18/2022	essential	15
9/18/2022	food	14
9/18/2022	fuel	12
9/18/2022	the	9
9/18/2022	teachers	8
9/19/2022	childcare	85
9/19/2022	food	28
9/19/2022	foreign	25
9/19/2022	staff	21
9/19/2022	skilled	20
9/19/2022	fueling	17
9/19/2022	teachers	17
9/19/2022	affordable	16
9/19/2022	chips	15
9/19/2022	the	13
9/20/2022	family	88
9/20/2022	housing	81
9/20/2022	specialists	56
9/20/2022	computer	39
9/20/2022	fueling	33
9/20/2022	antiretroviral	27
9/20/2022	food	26
9/20/2022	skilled	23
9/20/2022	difficult	22
9/20/2022	chips	21
9/21/2022	difficult	247
9/21/2022	computer	135
9/21/2022	supply	48
9/21/2022	specialists	41
9/21/2022	manpower	25
9/21/2022	the	24
9/21/2022	workers	23
9/21/2022	deep	22
9/21/2022	infant	21
9/21/2022	housing	20
9/22/2022	food	68
9/22/2022	difficult	53
9/22/2022	housekeeping	48
9/22/2022	carers	47
9/22/2022	computer	35
9/22/2022	housing	28
9/22/2022	foodgrains	26
9/22/2022	the	25
9/22/2022	moderna	23
9/22/2022	dealer/casino	22
9/23/2022	fuel	178
9/23/2022	moderna	149
9/23/2022	food	112
9/23/2022	critical	86
9/23/2022	the	30
9/23/2022	legal	29
9/23/2022	gas	23
9/23/2022	workers	17
9/23/2022	female	16
9/23/2022	housing	16
9/24/2022	fuel	39
9/24/2022	troops	29
9/24/2022	activewear	24
9/24/2022	essential	21
9/24/2022	food	16
9/24/2022	labor	15
9/24/2022	essentials	14
9/24/2022	moderna	14
9/24/2022	high-tech	10
9/24/2022	the	8

The GDELT Project

Creating A Daily Global Shortage Timeline Using Web NGrams 3.0 & BigQuery In One SQL Query

Archives