Realtime GKG-Like Custom Thematic Scanning Of The News Using Web NGrams 3.0

One of the most common requests we receive from researchers is the ability to create custom GKG themes. Using the new Web News NGrams 3.0 dataset it is now possible for you to perform your own GKG-like thematic scanning of the web using custom keywords in realtime!

First, create a new tab-delimited text file called "KEYWORDS.TXT" with three columns. The first is the keyphrase to search for. This can be in any language and can be a single word or a multi-word phrase. The second column specifies whether the keyphrase should be searched for in case sensitive or sensitive fashion. A value of 0 indicates a case sensitive keyword (such for a proper name or acronym), while a value of 1 indicates a case insensitive search. The third column is a comma-separated list of themes to associate with this keyword.

You can download a sample KEYWORDS.TXT file to see the proper formatting. You can have as many rows as you want here. Themes can be in any format you want, as long as they don't contain commas or tabs inside the theme name. You might choose to adopt the same format as used by the GKG, which is to use all-uppercase theme names with underscores in place of spaces. The only restriction is that themes must use ASCII characters at this time, but you could easily modify the script to allow for Unicode theme names.

Then install "pigz" and "curl" on your system and ensure you have a recent version of PERL and JSON::XS. On Debian systems you can do so via:

apt-get -y install pigz
apt-get -y install curl
apt-get -y install libjson-xs-perl

Then download the "demo-keywordalerts.pl" PERL script and put it in the same directory. Make it executable:

chmod 755 ./demo-keywordalerts.pl

Now you can simply run "./demo-keywordalerts.pl" every minute to scan the Web NGrams 3.0 dataset for your keywords! The script will download the latest Web NGrams 3.0 file from two minutes ago, scan it for all of your keywords, then if there were any matches, download the GDELT Article List file to get the publication metadata for each article (title, language, author, etc), merge the two together and output the results under a "./RESULTS/" subdirectory named as "YYYYMMDDHHMMSS.results.json" from two minutes ago. Each row will be the GDELT Article List record for the URL, with a single field appended to the end of the record called "themes" that contains the comma-separated list of matching themes. Only articles that matched your keywords will be included.

For the time being until GDELT 3.0 launches, there will be a high level of article duplication in the Web NGrams 3.0 dataset, so you may see the same URL again and should filter it out if needed.

That's it! That's all it takes to perform realtime custom thematic scanning of the news!

Of course, since the Web NGrams 3.0 dataset is in the original language of each article, rather than using machine translation, if you want to search for a given keyword across languages, you'll need to translate it into all of the languages you wish to search.

Note that because this is a demonstration script, it uses a pathologically inefficient search method that simply loops over each keyword and searches for it in turn. On a typical medium-end single-CPU system, this should return results within a few seconds for a small set of keywords, but as the number of keywords being searched grows, its performance will degrade and you will likely want to use a more realistic search algorithm. However, using a simple for loop wrapped around an index() call makes the code trivial to understand and sidesteps the complex logic required for more efficient searching algorithms.

To automatically run this script every minute requires just a few more steps!

Create a new script called "run.sh" in the same directory and set its contents to:

#!/bin/sh
cd /YOURDIRECTORY/
./demo-keywordalerts.pl

Then create a CRONJOB file called "CRONJOB.TXT" to run this script every minute:

0-59 0-23 * * * /YOURDIRECTORY/run.sh

And enable it:

crontab /YOURDIRECTORY/CRONJOB.TXT

That's it! Every minute the script will run and if it finds any matches will output a new file to the ./RESULTS/ subdirectory. Given that GDELT 2.0 operates on a 15 minute heartbeat, typically you'll see files cluster for a few minutes every 15 minutes, with a gap of no results in the minutes inbetween.

You can easily customize this PERL script to far more advanced and efficient behaviors or simply use it as a template to implement in the language of your choice like Python.

Happy searching!