Historical 2020-Present Custom Search & Thematic Scanning Using Web NGrams 3.0

Kalev Leetaru

4 years ago

With the Web NGrams 3.0's new January 1, 2020 to present historical backfile, you can now perform retroactive historical keyword search and thematic scanning over everything GDELT has monitored over the last two years!

You can query this dataset through two primary approaches: processing the raw JSON files or through BigQuery. Note that due to the way the NGrams 3.0 dataset was computed, there is an elevated number of duplicate records in that dataset that you can simply filter out by using only the first instance of each URL.

Raw JSON File Processing

You can simply keyword search the raw JSON files for the Web News NGrams 3.0 dataset and join the results against the GDELT Article List file for the metadata about each article.

NOTE that this approach can require considerable bandwidth. Searching just a few dates can require tens or even hundreds of gigabytes of bandwidth and thus should be performed only from a VM in the cloud or on a university or other institutional network, not from a home internet connection.

First, follow the directions for setting up realtime keyword scanning to download the "demo-keywordalerts.pl" Perl script, set up your keywords file and install the necessary libraries.

You can then use the exact same script to process these historical files. Simply generate a list of the specific minute date/times you wish to process and invoke "demo-keywordalerts.pl" for each minute. To make this easy, we've created a simple Perl script that generates all of the minute files for a given date range. Download "demo-makedaterangelist.pl" and make it executable:

wget http://data.gdeltproject.org/blog/2022-ngramsv3-demos/demo-makedaterangelist.pl
chmod 755 ./demo-makedaterangelist.pl

Then run it for a given date range, with both start and end dates expressed as YYYYMMDD:

./demo-makedaterangelist.pl 20211231 20220101

This will generate a new file called "DATES.TXT" that lists all of the minute files in this range. Remember that not all minutes will have data associated with them, but this script generates the full list of minutes across the given date range.

Now install GNU parallel:

apt-get -y install parallel

And then invoke "demo-keywordalerts.pl" for each date, using all available processors on your computer:

time cat DATES.TXT | parallel --eta './demo-keywordalerts.pl {}'

When it eventually completes, you'll have a list of all of the matching results in the "./RESULTS/" subdirectory of the directory you ran it in.

BigQuery

A far simpler and more scalable approach is simply to use the BigQuery copy of the ngrams dataset, which you can analyze through simple SQL queries.

NOTE that due to the size of the ngrams dataset, you can quickly exceed your monthly BigQuery free search quota, so should pay close attention to the estimated query quota to be consumed and the estimated query cost before running.

Here is a simple query that searches for all articles GDELT saw on January 10, 2022 containing the English words "covid" or "coronavirus" or "omicron" or "delta variant" (using a case-insensitive match), then merges that list of matching URLs against the GDELT Article List to get the full metadata records for each article:

select * from `gdelt-bq.gdeltv2.gal` where DATE(date) = "2022-01-10" and url in (
  select distinct(url) from (
    SELECT url, IF(type=1, CONCAT(pre, ' ', ngram, ' ', post), CONCAT(pre, ngram, post)) search FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) = "2022-01-10"
  ) where LOWER(search) like '%covid%' or LOWER(search) like '%coronavirus%' or LOWER(search) like '%omicron%' or LOWER(search) like '%delta variant%'
)

Note the use of "where DATE(date)" in both the GAL and NGrams queries to limit the search just to that single day to minimize search costs. If you want a case-sensitive match, simply change "LOWER(search)" to "search". The "IF()" statement is used to concatenate the pre+ngram+post strings correctly, placing spaces between them for space-segmented languages and concatenating them without spaces for scriptio continua languages.

Remember that ngrams are in the original source language, so if you wanted to search for the word "Covid" in other languages you would need to translate it into each of the other languages you are interested in.

Alternatively, if you want to perform more complex searches that require more sophisticated pattern expressions, you can use "REGEXP_CONTAINS()" and express your search as a regular expression. This also results in a more compact query that allows the full range of regexp expressiveness:

select * from `gdelt-bq.gdeltv2.gal` where DATE(date) = "2022-01-10" and url in (
  select distinct(url) from (
    SELECT url, IF(type=1, CONCAT(pre, ' ', ngram, ' ', post), CONCAT(pre, ngram, post)) search FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) = "2022-01-10"
  ) where REGEXP_CONTAINS(LOWER(search), r'(covid|coronavirus|omicron|delta variant)')
)

Hopefully this shows how easy it is to use the new NGrams 3.0 historical backfile to conduct historical keyword querying and thematic analysis of media coverage since the start of the pandemic!