Identifying Breaking Online News Stories Across The World In Realtime With The GCP Timeseries Insights API & Natural Language API

Captured within GDELT's vast realtime archives are the patterns of Planet Earth. Last June we explored how Google Cloud’s new Timeseries Insights API, coupled with Google Cloud’s Natural Language API could be used to retroactively tease out the moment that major stories first broke into the global news cycle. Yet, the real power of such advanced temporal analysis lies in the ability to detect emerging stories in realtime, flagging their earliest glimmers. Today we are tremendously excited to announce a new experimental dataset that does exactly that, coupling the Timeseries Insights API with the Natural Language API to provide a first glimpse at what realtime entity-level trend analysis and anomaly detection looks like over the global news media.

We'll be fine-tuning this dataset over time, but for the moment, every 15 minutes we ask the Timeseries Insights API to examine the past hour of global English language online news coverage as annotated by the Natural Language API and compare the trajectory of mentions of all entities over that hour compared with the previous three days, compiling a list of all entities that appear to be sharply increasing in mention in the last 60 minutes. In essence, we are compiling an entity-level list of the people, organizations, locations, events and other major topics that are suddenly trending.

It is important to remember that at present GDELT runs only a small random sample of English language coverage it monitors through the Natural Language API, so this dataset represents just a small fraction of what GDELT is monitoring across the world. The Natural Language API only supports a small number of languages at present and GDELT annotates a much smaller percentage of coverage in those languages compared to English, so for the moment this demonstration is limited to trend detection across English language online news coverage. Despite these limitations, it offers a powerful glimpse at what realtime trend detection across the news can look like!

Pipeline Overview

Below is a simplified version of the underlying pipeline:

  • Entity Annotation. Every minute, GDELT takes a small random sample of online news coverage it monitored over the previous minute in the 11 languages supported by Google's Cloud Natural Language API and asks the API to annotate all of the entities it recognizes in each article. Typically around 100K-200K articles a day are annotated. The resulting list of entities is compiled into GDELT's Global Entity Graph dataset.
  • Timeseries Ingest. Every minute, GDELT takes all of the entities extracted by the Natural Language API the previous minute and appends them to the Timeseries Insights API.
  • Timeseries Detection. Every 15 minutes, GDELT asks the Timeseries Insights API to identify all anomalous entities that appear to be sharply trending in the past 60 minutes (using a rolling window) compared with the previous three days. For the moment we're computing this every 15 minutes since GDELT 2.0 operates on a 15 minute heartbeat, but once GDELT 3.0 comes online we'll be producing these every minute.

Downloading The Dataset

If you want to just dive into the resulting dataset without worrying about the details, download the following URLs every 15 minutes (at :00, :15, :30 and :45 after each hour). Note that given the very small sample of articles GDELT annotates through the Natural Language API, not all hours may have sufficient annotated entities to yield strong anomaly results, so some 15 minute periods may be missing one or more files or they may have no entities.

For more information about each of the entity types below, see the Natural Language API's documentation. Only the entity types below are currently included, not the full list of entities the API recognizes. The date format is "YYYYMMDDHHMMSS" and since the first available date/time is February 15, 2022 at 2:15 UTC, the first files are:

  • http://data.gdeltproject.org/gdeltv3/timeseriesinsights/web_nlpapi/20220215021500.EntityCONSUMER_GOOD.anomalies.json.gz
  • http://data.gdeltproject.org/gdeltv3/timeseriesinsights/web_nlpapi/20220215021500.EntityEVENT.anomalies.json.gz
  • http://data.gdeltproject.org/gdeltv3/timeseriesinsights/web_nlpapi/20220215021500.EntityLOCATION.anomalies.json.gz
  • http://data.gdeltproject.org/gdeltv3/timeseriesinsights/web_nlpapi/20220215021500.EntityORGANIZATION.anomalies.json.gz
  • http://data.gdeltproject.org/gdeltv3/timeseriesinsights/web_nlpapi/20220215021500.EntityPERSON.anomalies.json.gz
  • http://data.gdeltproject.org/gdeltv3/timeseriesinsights/web_nlpapi/20220215021500.EntityWORK_OF_ART.anomalies.json.gz

To understand the format of the files above and what all of the various fields mean, see the Timeseries Insights API "ForecastResult" documentation.

Creating The Dataset

The first step in creating this dataset was to create an initial static "seed" dataset containing a week of annotations from the Natural Language API. We used BigQuery to create an extract from GDELT's Global Entity Graph dataset. You can see the extract is quite simple. We compile all of the entities from English language articles published from January 15 to 23, 2022, titlecase them, organize by type and collapse into an array under the URL. Since the Timeseries Insights API requires that events be organized under a numeric groupID, we use the FarmHash of the URL to generate the necessary 64-bit numeric ID. This static dataset will act as the initial historical comparison period for the dataset.

WITH data AS (
  SELECT
    FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", date, "UTC") eventTime,
    FARM_FINGERPRINT(url) groupId, 
    STRUCT(CONCAT('Entity',entity.type) as name, INITCAP(entity.name) as stringVal) as dimensions
    FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, UNNEST(entities) entity WHERE DATE(date) >= "2022-01-15" and DATE(date) <= "2022-01-23" and lang='en'
)
select eventTime, groupId, ARRAY_AGG(dimensions) AS dimensions FROM data GROUP BY eventTime, groupId

BigQuery automatically shards exported data into multiple files, so we use GSUTIL's "cat" feature to concatenate the sharded export back into a single file:

gsutil -q cat gs://[BUCKET]/timeapi-nlpapi-*.json | gsutil cp - gs://[BUCKET]/geg_nlpapi-production-20220115-20220123.json

We then create the new dataset in the Timeseries Insights API and seed it using this initial static historical dataset. The command below tells the API to create the new dataset, tells it what fields to expect, tells it that we expect to stream new dataset into it after it is created and gives it the GCS path of the static dataset:

time curl -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[PROJECTID]/datasets -d '{
  "name":"geg_nlpapi-production",
  "dataNames": [
    "EntityCONSUMER_GOOD",
    "EntityEVENT",
    "EntityLOCATION",
    "EntityORGANIZATION",
    "EntityOTHER",
    "EntityPERSON",
    "EntityUNKNOWN",
    "EntityWORK_OF_ART",
  ],
  "dataSources": [
      { "uri":"gs://[BUCKET]/geg_nlpapi-production-20220115-20220123.json" }
  ]
}'

The API will now create the dataset. We can check its status to see when it has finished loading:

curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[PROJECTID]/datasets

Once the dataset has been created, we can now append to it in realtime!

Every minute, we have a PERL script that essentially repeats the BigQuery query above, but narrowed to search just the previous minute for new articles that have been freshly crawled and annotated through the Natural Language API. While FarmHash is readily available in PERL, it is not easily accessible in every scripting language, so to make a more universal solution, we replace the "FARM_FINGERPRINT(url)" of the BigQuery query with a bit of creative PERL that converts the URL to UTF8, then takes its MD5. Since an MD5 is 128-bits and the API requires a 64-bit groupID, we truncate the MD5 to its first 8 bytes and use PERL's "unpack" function to interpret it as a signed quad, also known as a long long or a signed 64-bit integer ("int64"):

use Encode qw(encode_utf8);
use Digest::MD5 qw(md5);
my $md5 = md5(encode_utf8($url));
my $groupId = unpack('q', substr($md5, 0, 8)) . '';

This does mean that there is a slight mismatch between the groupIds used in the initial static dataset and the truncated MD5-based groupIds used by the live-appended dataset, but there should be minimal issues for this application.

Note carefully that we take the resulting 64-bit numeric datatype and convert it to a string. The reason for this is that int64's cannot be safely exchanged as numeric datatypes in JSON, so Google recommends that they be cast into strings.

How do we then append these new articles to our Timeseries Insights API? It turns out that the API makes this trivial!

Every minute, we compile a list of all of the new articles crawled in the past minute that have been annotated by the Natural Language API. We compute a groupID for each URL using the code above and compile all of its entities into an array. We then create a JSON record per article (each unique groupID has to be its own separate file) that is an array of "events" with the "groupId" and "eventTime" of the article's publication, with an array of "dimensions" recording all of the entities found in the article.

Thus, if in the last minute we have 1,000 freshly crawled articles annotated by the Natural Language API, we write each one to a separate file on disk that looks like this:

{ "events": [ { "groupId": "-719047665294405379", "eventTime": "2022-02-14T20:03:15+00:00", "dimensions": [ {"name": "EntityPERSON", "stringVal": "Expert"},{"name": "EntityORGANIZATION", "stringVal": "University Of Western Australia"},{"name": "EntityORGANIZATION", "stringVal": "Lead World Motorsport Safety Body"},{"name": "EntityPERSON", "stringVal": "Sam Michael"},{"name": "EntityORGANIZATION", "stringVal": "Federation Internationale De L'Automobile"},{"name": "EntityOTHER", "stringVal": "News"},{"name": "EntityLOCATION", "stringVal": "Australia"},{"name": "EntityOTHER", "stringVal": "Maintenance"},{"name": "EntityPERSON", "stringVal": "President"},{"name": "EntityORGANIZATION", "stringVal": "Safety Commission"},{"name": "EntityPERSON", "stringVal": "Tnsper"},{"name": "EntityLOCATION", "stringVal": "Perth"},{"name":"EntityORGANIZATION", "stringVal": "Business School"},{"name": "EntityORGANIZATION", "stringVal": "Bhp Fellow Of Leadership"},{"name": "EntityOTHER", "stringVal": "Leadership"},{"name": "EntityOTHER", "stringVal": "Maintenance Unit"},{"name": "EntityOTHER", "stringVal": "Information"},{"name": "EntityCONSUMER_GOOD", "stringVal": "Products"},{"name": "EntityPERSON", "stringVal": "Veteran"},{"name": "EntityORGANIZATION", "stringVal": "Australian Formula 1"},{"name": "EntityLOCATION", "stringVal": "Geraldton"} ] }] }

The end result is a directory with 1,000 JSON files, each one representing one article.

Next, we get our bearer token for authenticating to the API:

my $bearertoken = `gcloud auth application-default print-access-token`; $bearertoken=~s/\s+$//;

And then we use CURL to submit each to the API in parallel (this assumes the files are numbered sequentially from 0 to X):

seq 0 $NUMBERFILES | parallel -j 10 'curl -s -H \"Content-Type: application/json\" -H \"Authorization: Bearer $bearertoken\" -X POST -d \@$CACHE_TMP/{}.json https://timeseriesinsights.googleapis.com/v1/projects/[PROJECTID]/datasets/geg_nlpapi-production:appendEvents -o $CACHE_TMP/{}.json.ret'");

Note that the code above does not check the return code or response from the API. Thus, it does not automatically respond to rate limiting or API errors. A production-grade application would include appropriate handling for all of the various kinds of potential API error codes, but for the sake of simplicity, the pipeline presented in this demonstration avoids that error handling logic to focus just on the basic API interaction.

That's it! The API ingests each appended record and automatically updates the underlying dataset!

Finally, it is time to query the dataset!

The Timeseries Insights API offers a wealth of parameters that can be adjusted to fit any application and the specific settings that are best for any given application will vary depending on the characteristics of the underlying dataset. In this case, after extensive experimentation, the following query was found to yield the best overall results:

time curl -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[PROJECTID]/datasets/geg_nlpapi-production:query -d '{
  detectionTime: "2022-02-14T15:00:00Z",
  slicingParams: {
    dimensionNames: ["EntityLOCATION"]
  },
  timeseriesParams: {
    forecastHistory: "259200s",
    granularity: "3600s",
    minDensity: 0
  },
  forecastParams: {
    sensitivity: 0.25,
    noiseThreshold: 0,
    seasonalityHint: "DAILY"
  },
  returnNonAnomalies: false,
  returnTimeseries: true
}' > RESULTS.TXT; grep stringVal RESULTS.TXT | head -20;

You can see more technical detail about each of the parameters above in the API's documentation. The "detectionTime" and "granularity" parameters tell the API the period of time to search for anomalies. In this case, since the detectionTime is 15:00 UTC on February 14, 2022 and the granularity is 3600 seconds (one hour), the API will search from 15:00 to 16:00UTC for anomalies. The "forecastHistory" parameter tells the API how far into the past (from the detectionTime) to look to determine a baseline against which to compare the search period when looking for anomalies. In this case we are telling the API to look 259,200 seconds (3 days) prior to the detectionTime. The API also uses the granularity to bin the historical comparison period, so in this case it will bin the previous three days into hour-long periods.

Put more simply, the API will search from 15:00UTC to 16:00UTC on February 14, 2022 looking for anomalies by comparing that hour against the 72 hours from February 11, 2022 at 15:00UTC until February 14, 2022 at 15:00UTC. The API will compile a list of every entity it finds from 15:00 to 16:00UTC on February 14th, then compile the hourly timeline of all of its appearances from February 11th at 15:00UTC through February 14th at 15:00UTC and use that hourly timeline to see if its total number of mentions from 15:00 to 16:00UTC on February 14th are highly unexpected given the previous 72 hours.

The other parameters like minDensity (allows you to filter out rare entities that aren't mentioned very often), sensitivity (just how anomalous an entity must be to be flagged) and noiseThreshold (filter out jitter) are left at their default settings and allow you to fine-tune what the API sees as an "anomaly." We also set the seasonalityHint to suggest that there are likely macro-level daily trends. The returnTimeseries parameter tells the API that for each anomaly it should return the full 72-hour historical timeline that was used to evaluate it, making it easy for end applications to display its trending timeline. For debugging purposes, returnNonAnomalies can be set to true to tell the API to output every single entity it examined, even if it ultimately determined that it was not an anomaly, in order to understand the total universe of entities it is considering for each query.

In the case of the dataset above, we run the query above every 15 minutes, replacing "EntityLOCATION" with each of the entity types above and then output the results to GCS for you to download!

Below is a sample entry that shows what the API output looks like. Each file is wrapped with an outer "anomalyDetectionResult" that, in turn, contains an array of "anomalies". Each anomaly is a JSON object that contains a wealth of information about it. The "ForecastResult" API documentation provides more information about the meaning of each of these fields. In this case, the entity is the location "World" and the "history" array contains the hourly timeline of how many times it was seen per hour over the previous three days. You can immediately see that most hours it is seen around 200-300 times, but in the final hour it was seen 945 times, marking a huge surge in mentions that resulted in the API flagging it as an anomaly.

{
   "anomalyDetectionResult" : {
      "anomalies" : [
         {
            "result" : {
               "label" : "ANOMALY",
               "holdoutErrors" : {
                  "mdape" : 1,
                  "rmd" : 1
               },
               "history" : {
                  "point" : [
                     {
                        "value" : 216,
                        "time" : "2022-02-12T01:00:00Z"
                     },
                     {
                        "time" : "2022-02-12T02:00:00Z",
                        "value" : 202
                     },
                     {
                        "time" : "2022-02-12T03:00:00Z",
                        "value" : 174
                     },
                     {
                        "time" : "2022-02-12T04:00:00Z",
                        "value" : 200
                     },
                     {
                        "value" : 210,
                        "time" : "2022-02-12T05:00:00Z"
                     },
                     {
                        "value" : 198,
                        "time" : "2022-02-12T06:00:00Z"
                     },
                     {
                        "value" : 204,
                        "time" : "2022-02-12T07:00:00Z"
                     },
                     {
                        "time" : "2022-02-12T08:00:00Z",
                        "value" : 248
                     },
                     {
                        "value" : 182,
                        "time" : "2022-02-12T09:00:00Z"
                     },
                     {
                        "time" : "2022-02-12T10:00:00Z",
                        "value" : 246
                     },
                     {
                        "time" : "2022-02-12T11:00:00Z",
                        "value" : 288
                     },
                     {
                        "value" : 256,
                        "time" : "2022-02-12T12:00:00Z"
                     },
                     {
                        "value" : 198,
                        "time" : "2022-02-12T13:00:00Z"
                     },
                     {
                        "time" : "2022-02-12T14:00:00Z",
                        "value" : 182
                     },
                     {
                        "time" : "2022-02-12T15:00:00Z",
                        "value" : 210
                     },
                     {
                        "value" : 206,
                        "time" : "2022-02-12T16:00:00Z"
                     },
                     {
                        "time" : "2022-02-12T17:00:00Z",
                        "value" : 260
                     },
                     {
                        "value" : 242,
                        "time" : "2022-02-12T18:00:00Z"
                     },
                     {
                        "value" : 252,
                        "time" : "2022-02-12T19:00:00Z"
                     },
                     {
                        "value" : 232,
                        "time" : "2022-02-12T20:00:00Z"
                     },
                     {
                        "value" : 222,
                        "time" : "2022-02-12T21:00:00Z"
                     },
                     {
                        "value" : 216,
                        "time" : "2022-02-12T22:00:00Z"
                     },
                     {
                        "value" : 158,
                        "time" : "2022-02-12T23:00:00Z"
                     },
                     {
                        "time" : "2022-02-13T00:00:00Z",
                        "value" : 152
                     },
                     {
                        "value" : 168,
                        "time" : "2022-02-13T01:00:00Z"
                     },
                     {
                        "time" : "2022-02-13T02:00:00Z",
                        "value" : 182
                     },
                     {
                        "value" : 160,
                        "time" : "2022-02-13T03:00:00Z"
                     },
                     {
                        "time" : "2022-02-13T04:00:00Z",
                        "value" : 188
                     },
                     {
                        "value" : 208,
                        "time" : "2022-02-13T05:00:00Z"
                     },
                     {
                        "value" : 206,
                        "time" : "2022-02-13T06:00:00Z"
                     },
                     {
                        "time" : "2022-02-13T07:00:00Z",
                        "value" : 238
                     },
                     {
                        "time" : "2022-02-13T08:00:00Z",
                        "value" : 218
                     },
                     {
                        "time" : "2022-02-13T09:00:00Z",
                        "value" : 202
                     },
                     {
                        "value" : 240,
                        "time" : "2022-02-13T10:00:00Z"
                     },
                     {
                        "value" : 216,
                        "time" : "2022-02-13T11:00:00Z"
                     },
                     {
                        "time" : "2022-02-13T12:00:00Z",
                        "value" : 202
                     },
                     {
                        "value" : 242,
                        "time" : "2022-02-13T13:00:00Z"
                     },
                     {
                        "time" : "2022-02-13T14:00:00Z",
                        "value" : 220
                     },
                     {
                        "value" : 262,
                        "time" : "2022-02-13T15:00:00Z"
                     },
                     {
                        "time" : "2022-02-13T16:00:00Z",
                        "value" : 268
                     },
                     {
                        "value" : 292,
                        "time" : "2022-02-13T17:00:00Z"
                     },
                     {
                        "value" : 162,
                        "time" : "2022-02-13T18:00:00Z"
                     },
                     {
                        "value" : 945,
                        "time" : "2022-02-15T01:00:00Z"
                     }
                  ]
               },
               "detectionPointForecastLowerBound" : -3.65238479828758,
               "trainingErrors" : {
                  "rmd" : 0.0679901299906175,
                  "mdape" : 0.0901999508208943
               },
               "forecastStats" : {
                  "density" : "58"
               },
               "detectionPointActual" : 945,
               "forecast" : {
                  "point" : [
                     {
                        "time" : "2022-02-15T01:00:00Z",
                        "value" : -1.82619239914379
                     }
                  ]
               },
               "detectionPointForecast" : -1.82619239914379
            },
            "dimensions" : [
               {
                  "stringVal" : "World",
                  "name" : "EntityLOCATION"
               }
            ],
            "status" : {}
         }
......

We're tremendously excited to see what kinds of new applications you're able to create with this incredible new dataset! Look for announcements in the coming weeks as we integrate this data into some of our interfaces. Remember that this is an experimental dataset, so we may adjust some of the parameters over time and the API may add or remove fields in the JSON.