Realtime Global Massively Multilingual Horizon Scanning Using GDELT's New Web News NGrams 3.0 Dataset

How can GDELT's new Web News NGrams 3.0 dataset be used for horizon scanning? With the release of this new dataset last week, it is now almost trivial to perform global-scale massively multilingual horizon scanning across the world's news landscape. To help get you started with this massive new dataset, we've put together a basic template of how you can rapidly build a global horizon scanning system using this new dataset and layer rich new analytic capabilities on it over time.

Initial Deployment: Keyword Scanning

Perhaps the most basic application is keyword scanning in which you have a predefined list of keywords and phrases you are interested in and you want to scan global news coverage for articles that mention those terms. These could be topical terms like "corruption" or "drought" or the names of specific entities like "United Nations" or "World Bank."

Using a single 2 or 4-core VM, you can write a simple Python script that runs by cronjob every minute and downloads the latest Web News NGrams 3.0 from 2 minutes ago (typically there is a rolling 2 minute lag for new files to become available). As of this writing (December 2021) there will be a burst of files every 15 minutes (:00, :01, :02, :03, :04 then :15, :16, :17, :18, :19 then :30 and so on) with the minutes in between not having files since GDELT still internally operates on a 15 minute heartbeat, but you should design your pipeline to expect a file every minute as this spring GDELT will be moving to an internal one minute heartbeat. For the next month or so there will be a certain number of duplicate URLs appearing in the dataset because of how it is generated, so simply filter those out.

So, in short, create a small VM that runs a cronjob every minute and downloads the ngram file from two minutes ago. So for example, when the cronjob runs on December 15, 2021 at 00:03 UTC, it might use CURL and PIGZ to pipe the latest ngram file into your processing script that searches the ngrams for matches against your keyword list:

curl -s http://data.gdeltproject.org/gdeltv3/webngrams/20211215000100.webngrams.json.gz | pigz -d | ./myscript.py

Or you can use the BigQuery version of the dataset as well (make absolutely certain to use date limits to take advantage of the table partitioning to minimize the amount of data to be searched).

Following the Web News NGrams 3.0 documentation, your script would then take your list of keywords and scan each row of the ngram file for matches, searching the "ngram" field for one-word keywords and the "ngram" field together with the "pre" and "post" fields to scan for 2-5+ word names. Instructions for searching scriptio continua languages (both ideographic like Chinese and Japanese and alphabetic like Myanmar and Thai) can also be found in the documentation.

You can use the "pre" and "post" fields for additional relevancy filtering, perhaps searching for the presence or absence of additional keywords to filter matches.

This gives you a list of matching URLs that contain your keywords of interest. To get the title, author, date, outlet name, image, summary and other details for each matching article, you should have a second Python script running via cronjob that runs each minute and downloads the latest GDELT Article List file from two minutes ago, which contains one row per article and lists all of the metadata about that article. Simply take all of the matching URLs from the ngrams file and cross them against this file to look up their details. This gives you all of the metadata about each article to display rich graphical links like the following:

Its literally that simple! Two cronjobs running on a single VM every minute, one downloading the ngram file and feeding it to a script that scans it for keywords of interest and a second downloading the metadata for each article.

Note that since the ngrams dataset covers more than 150 languages and ngrams are in their native languages, this means that if you want to scan for Arabic-language coverage of "drought" you would need to compile a list of Arabic-language terms used to describe drought. (This is different from GDELT's GKG 2.0 themes which rely on machine translation). Thus, you will need to decide which languages you wish to scan and on your own translate your terms into each of those languages. For example, if you want to scan for mentions of "climate change" across global media, you would need to work with subject matter experts to compile a lexicon of all of the various terms for climatic change in each language you are interested in. Note that you could try bootstrapping your monitoring system by using Google Translate to translate your English terms into each language you are interested in, either by hand for a small number of terms using Google Translate's web interface or at scale using the Google Translation AI API and then over time work with SMEs in each language to compile an expansive list of all of the keywords and phrases in that language that are used to describe climatic change.

The end result is that with a single small VM, two cronjobs and two scripts you can scan global news media in realtime in over 150 languages!

Enhancement 1: Interactive Keyword Search

The workflow above is designed for batch scanning using a predefined list of keywords, rather than interactive user-driven search. To enable interactive search, you can take the workflow above and feed the ngram dataset into an Elasticsearch cluster. Rather than a traditional fulltext Elasticsearch search, you build an index of ngrams, where each record is a single 1-5 gram, along with the article metadata from the Article List dataset and user searches consist of an exact match against this set of ngrams. Alternatively, you can concatenate the "pre" and "ngram" and "post" fields together into a short snippet to be indexed.

For interactive keyword search, you simply perform an exact match of each input word/phrase against this ngram field or a search of the contextual snippet and return the results to the user.

Enhancement 2: Keyword Context Relevancy Filtering

For either fixed keyword scanning or interactive search, you can use the "pre" and "post" context snippets to perform relevancy filtering of results. For example, a mention of "demonstrations" could refer either to citizen uprisings or the latest technology demo conference. Similarly, "Delta" could refer to "Delta Airlines" or the "Delta Covid-19 variant." How best to filter these results?

Relevancy filtering can be as simple as a set of additional manually-curated keywords that should or should not appear near the given term. For example, requiring that "variant" appear near "Delta" or that "airline" not appear near it to distinguish the Delta Covid-19 variant from Delta Airlines.

Alternatively, for more complex relevancy filtering, you could manually skim through the ngram dataset and select a set of relevant and irrelevant mentions and train an ML model (Bayesian or neural) that evaluates the "pre" and "post" contexts for the ultimate in relevancy filtering. This could be done using everything from a simple CPU-based Naïve Bayesian model to a handcrafted model in TensorFlow on a GPU VM to a Vertex AI hosted model to automated construction using AutoML.

Enhancement 3: Article-Level Categorization & Relevancy

What about article-level categorization? Many applications are interested more in article-level categorization than keyword search. Perhaps the simplest out-of-the-box method is to use the Global Similarity Graph (GSG). This dataset consists of USEv4 document-level embeddings over the English machine translations of the 65 languages GDELT currently live-translates. To perform document-level categorization, you can simply find a sample article or articles that define each category of interest and then download the latest GSG file every minute and compute the cosine similarity of each GSG embeddings for your selected articles and keep those with above a certain threshold of similarity.

Alternatively, for more generalized article categorization, simply concatenate together all of the pre+ngram+post snippets for each article into a single unordered text blob and run through your categorization model. While these text will be heavily out of order, most models such as BERT should be able to look past this to generate a robust classification for the article.

Enhancement 4: Entity Extraction

GDELT already runs a small random sample of articles each day through Google's Natural Language API to extract all of the entities mentioned within and makes this available as the Global Entity Graph (GEG). While this covers only a small random sample of articles each day and only in the 11 languages supported by the Natural Language API, it offers a turnkey approach to accessing news as a collecting of entities.

Alternatively, for each ngram record, you could concatenate the pre, ngram and post fields together and run the resulting snippet through the entity extraction tool of your choice, such as spaCy. Just remember that since ngrams represent a rolling window through an article's text, any given snippet might truncate a given entity within. For example, the final pre+ngram+post snippet for a given ngram might be "…Commissioner for Human Rights voiced concern that…" from which spaCy might recognize "Commissioner for Human Rights" as an entity. In reality, however, the full sentence reads "The United Nations Commissioner for Human Rights voiced concern that…," meaning that the recognized entity is incorrect since it truncates the actual entity. Thus, for maximal accuracy you may choose to exclude extracted entities that begin on the first word of the snippet or do not finish before the last word (keeping in mind edge cases where the connective word is the ending word such as "…Commissioner for…" where "for" is the last word in the snippet but is actually part of the entity name.

Enhancement 5: Semantic Search

Traditional keyword search can be highly limiting, since it requires an exact wording match. A search for "vaccines" and "microchips" won't match "nano chips" in "vials" for example. Moreover, it requires that users reduce their information seeking demands down to a handful of keywords. Instead, the Global Similarity Graph (GSG) can be used to perform semantic search in which the query input can be an entire article. In short, rather than users inputting keywords, they can input entire articles and find "more like this" or input keywords which the system interprets as general semantic concepts rather than specific words.

Performing large-scale similarity matching across large embeddings is an active area of research and is immensely computationally intensive. To address this, you can feed the Global Similarity Graph into Google's Vertex AI Matching Engine to perform realtime semantic search. You can see a demo of this using GDELT's Global Similarity Graph to enable semantic search over the news (live demo) (GitHub repository).

Enhancement 6: Visualizing The Narrative Landscape Of A Topical Space

Oftentimes the most challenging aspect of a query is not finding matching articles, but rather attempting to distill down a massive set of results into something understandable by a human analyst. For example, a search for news coverage about Covid-19 vaccines in a given day will yield a massive list of coverage of thousands or even tens of thousands of articles that no human would have time to read through in its entirety.

Instead, using the ngrams dataset first compile a list of coverage relevant to a given query. Then search the Global Similarity Graph (GSG) for the embeddings for each matching URL and feed that list of embeddings into Tensorflow Embedding Projector or UMAP or similar, which will cluster them together to create a visualization of the underlying narrative landscape, grouping together articles that similar, even across languages and word choices.

You can see a simple example for a day of Covid-19 vaccine coverage below.

We hope this basic template inspires you with ideas of how you can use GDELT's new ngram dataset to perform realtime global massively multilingual horizon scanning.

If you have any questions, ideas or examples we'd love to hear from you!