Identifying Breaking News Stories Across The World With Google’s Timeseries Insights API

GDELT today encompasses more than 8.4 trillion datapoints spanning global events and narratives in 152 languages across text, television, radio and still imagery news dating back 220 years and updating every minute. Within this immense realtime archive lies the patterns of Planet Earth: its realtime updates capture the earliest glimmers of tomorrow’s biggest stories, while its historical trends contextualize their impact. From Bluedot sending one of the earliest alerts of the impending Covid-19 pandemic in December 2019 to the earliest external warnings of 2014’s Ebola outbreak to the first signs of tomorrow’s conflicts and humanitarian disasters, GDELT’s realtime firehose records the first signs of the world’s stories big and small each day. How might we sift through this vast realtime data stream to identify those events in their earliest stages?

One of the challenges with analyzing datasets of GDELT’s scale is the sheer magnitude of its size and velocity and, most importantly, the need to incorporate historical data to take into consideration seasonality and the chaotic messiness and inadvertent trends of the real world. Each morning in each country there is a rush of news as the country wakes up, causing a clockwork surge of events. Mondays typically hold more breaking news than Fridays or weekends. Elections and holidays arrive at regular intervals. These journalistic and societal trends mean we can’t just look for tomorrow’s news within a brief rolling window, we have to combine realtime analysis with the context of the vast historical archive, meaning we need a platform that can detect trends with historical context like seasonality and do so across immense datasets with realtime responsiveness.

Enter Google Cloud’s new Timeseries Insights API. The Timeseries Insights API is a time series analysis tool designed to examine extremely large datasets and identify seasonality-adjusted anomalies in realtime with response times measured in seconds over billions of time series. How might we apply this API to GDELT to find tomorrow’s stories?

To showcase how the API works, it is useful to start with a known event that occurred suddenly and attracted a worldwide surge of coverage: the April 15, 2019 Notre-Dame de Paris fire. To search for the burst of coverage, we’ll use the Global Entity Graph, which analyzes a small random sample of online news coverage each day through Google Cloud’s Natural Language API and compiles the list of entities the NLP API identifies in each article.

The Timeseries API supports both batch historical loading of data and live streaming updates. For the sake of simplicity for these examples, we will focus on batch loading. To batch load data, the API uses a very simple JSON-based file format, with input files stored in GCS. Each record represents an “event,” which can be anything from a sensor reading to a news article. The exact meaning of an event will vary from use case to use case, but can be thought of as essentially a sampling in time of the things you are interested in. For example, if you are looking for temperature anomalies in a data center, an “event” would be a given temperature recording from a sensor. If you are looking for network traffic anomalies, an “event” might be the incoming/outgoing bandwidth and packet counts to/from a VM at a given instant in time. In our case, an “event” is the publication of a news article, since the news in effect “samples” the state of society.

Each “event” has a timestamp that specifies when it occurred (“eventTime”) and a list of attributes being measured in the dataset called “dimensions.” You can think of the dataset as a spreadsheet, with each row being an event and each column being a “dimension.” Several datatypes are supported for dimensions, but here we will stick with string values (“stringVal”). Events can optionally be given a unique “groupId” that groups together related events. For example, an application looking for temperature anomalies might associate all readings from a specific room in a given day with a groupID referring to that room. In our case, we treat each news article as its own groupId.

To extract the one-month period of April 2019 from the Global Entity Graph that contains the Notre Dame fire, we use the following BigQuery query. There are two inner queries joined together, with an outer query that reformats them into a struct array. The primary inner query searches for all English language news articles in the Global Entity Graph published in April 2019. Each URL is given its own unique groupId, but since URLs are strings and groupIds must be int64 numbers, we use BigQuery’s built-in FARM_FINGERPRINT() hashing function that converts any string into an int64. The FORMAT_TIMESTAMP() call is used to convert BigQuery’s TIMESTAMP data format into the RFC3339 UTC Zulu timestamp format required by the Timeseries API.

The Global Entity Graph lists each entity as it appeared in the article along with the unique identifier code of the underlying object, as provided by the Cloud Natural Language API. Thus, “Donald Trump,” “President Trump” and “Donald J. Trump” will all resolve to the same underlying unique identifier code. The second query joined to the first simply computes the most common name associated with each entity and uses that for all mentions. Thus, if one article refers to “Donald J. Trump” and five others mention “Donald Trump” it will see that “Donald Trump” is the most common name for that entity and use it for all six mentions. Finally, the “[STRUCT()]” groups all of the entities from an article together into a named array with the entity type and its most common name.

WITH data AS (select groupId, STRUCT(name, a.entity as stringVal) as dimensions, eventTime from (
    SELECT FARM_FINGERPRINT(url) groupId, CONCAT('Entity',entity.type) name, FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", date, "UTC") eventTime, entity.mid mid FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, UNNEST(entities) entity WHERE entity.mid is not null 
       and lang='en' and DATE(date) >= "2019-04-01" and DATE(date) <= "2019-04-30"
  ) b JOIN (
     SELECT APPROX_TOP_COUNT(entities.name, 1)[OFFSET(0)].value entity, entities.mid mid FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities where entities.mid is not null and lang='en' and DATE(date) >= "2019-04-01" and DATE(date) <= "2019-04-30" group by entities.mid
  ) a USING(mid)
)
SELECT eventTime, groupId, ARRAY_AGG(dimensions) AS dimensions FROM data GROUP BY eventTime, groupId

The resulting table is exported from BigQuery as JSON, but due to its size is sharded into multiple output files by BigQuery, so we merge into a final output file:

time gsutil -q cat gs://TEMPBUCKET/output-*.json | gsutil cp - gs://data.gdeltproject.org/blog/2021-timeseries-insights-api/datasets/webnlp-201904.json

The end result is a list of records that looks like:

{"groupId":"-825716969267786339","dimensions":[{"name":"EntityEVENT","stringVal":"London Marathon"}],"eventTime":"2019-04-27T19:00:00+00:00"}
{"groupId":"-6875809976771089992","dimensions":[{"name":"EntityEVENT","stringVal":"Earth Day"}],"eventTime":"2019-04-20T11:30:00+00:00"}
{"groupId":"-2361415204314941271","dimensions":[{"name":"EntityLOCATION","stringVal":"River Thames"}],"eventTime":"2019-04-21T16:00:00+00:00"}
{"groupId":"-2129314452709554534","dimensions":[{"name":"EntityLOCATION","stringVal":"Indonesia"}],"eventTime":"2019-04-08T17:45:00+00:00"}

Note that all records with the same groupId are associated together by the API, so for simplicities sake in the BigQuery query above, we can output each entity as a separate row, spreading all of the entities in a given article across multiple rows, but the API will group them all together for us when it loads the data.

In all, the final dataset totals 44.5M rows (each entity is its own row in this case) covering 2.9M articles (an average of 15 entities per article).

To load this dataset into the API we use the CURL call below (replace the project ID with your own). You can see that it is extremely simple, specifying the name you want to give the dataset, the JSON file in GCS that contains the data, and a list of the dimensions in the dataset. The Cloud Natural Language API identified eight kinds of entities from these April 2019 articles, which we named “EntityX” like “EntityEVENT” or “EntityPERSON,” etc, so we list each of those here.

time curl -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[YOURPROJECTID]/datasets -d '{
  "name":"webnlp-201904",
  "dataNames": [
    "EntityCONSUMER_GOOD",
    "EntityEVENT",
    "EntityLOCATION",
    "EntityORGANIZATION",
    "EntityOTHER",
    "EntityPERSON",
    "EntityUNKNOWN",
    "EntityWORK_OF_ART",
  ],
  "dataSources": [
    { "uri":"gs://data.gdeltproject.org/blog/2021-timeseries-insights-api/datasets/webnlp-201904.json" }
  ]
}'

Once submitted, the Timeseries API will load the dataset, taking anywhere from a few tens of minutes to an hour or two or longer depending on the size. To check on the status of the dataset and whether it has finished loading yet, run this CURL command:

curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[YOURPROJECTID]/datasets

It will return a list of all of your datasets and whether they are “LOADING” or “LOADED” or other status. Once a dataset says “LOADED” you are ready to begin querying it!

Querying a dataset for anomalies is trivial, requiring just a few parameters that can be adjusted to the unique needs of each dataset and use case. Using the API documentation, the query below searches the day of April 16, 2019 for mentions of locations that were mentioned dramatically more that day than in the preceding two weeks:

time curl -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/webnlp-201904:query -d '{
  dimensionNames: ["EntityLOCATION"],
  testedInterval: {
    startTime: "2019-04-15T00:00:00Z",
    length: "86400s"
  },
  forecastParams: {
    holdout: 10,
    minDensity: 0,
    forecastHistory: "1209600s",
    maxPositiveRelativeChange: 1,
    maxNegativeRelativeChange: 1,
    forecastExtraWeight: 0,
    seasonalityHint: "DAILY",
  },
  returnNonAnomalies: false
}' > RESULTS.TXT

The first parameter is “dimensionNames” that specifies which dimensions to analyze for anomalies. In our case, our dataset contains eight different dimensions (“EntityLOCATION,” “EntityEVENT,” etc), but we can limit our analysis to just a single dimension, making it easy to examine just slices of the data at a time. For the Notre Dame fire, we’re primarily interested in location mentions, so we’ll use “EntityLOCATION.”

Each query specifies two separate time periods: one to examine for anomalies (“testedInterval”) and the second to use as a historical baseline to compare against (“forecastHistory” under “forecastParams”) to determine what “normal” looks like in order to identify what an “anomaly” looks like.

The “testedInterval” parameters tell the API what period of time should be examined for anomalies, specified as a start timestamp (“startTime” expressed in RFC3339 UTC Zulu format) and the length of time from that start timestamp forward to examine (“length” expressed in seconds, including fractional seconds up to nine decimals for high-temporal-precision analyses). Here we specify midnight April 15, 2019 for a period of 86,400 seconds (24 hours) to tell the API to examine the entirety of April 15th.

In order to identify anomalies in the “testedInterval” period, the API needs an idea of what “normal” looks like for the dataset and how to define an “anomaly.” These parameters are set in the “forecastParams” section.

The “holdout” parameter specifies what percentage of the dataset should be used for accuracy evaluation (set higher for greater accuracy or lower if your dataset is extremely small). The “minDensity” parameter determines how often a given time series (a “time series” is the timeline of appearances of a particular value in a particular dimension over time) must appear in the forecast time period to be eligible for anomaly analysis. Setting this to 0 tells the API to examine all values, including those that don’t appear at all during the forecast period and only appear during the testing period. This is useful for detecting surges in novel values. In contrast, setting to a higher value can filter these out. The specific value you set for this parameter will be determined by your particular use case. For example, if you are looking for network anomalies in a large cluster and are constantly starting up new virtual machines each day, the biggest anomalies will be those new machines starting up, so you would set this to a higher value to only consider machines that were already running during the forecast period. In contrast, in our use case, we are especially interested in topics that were rarely mentioned in the news and then suddenly surge in mentions. For example, Notre Dame Cathedral would likely not be mentioned in the news very much at all in the week prior to the fire and then within the space of a few hours surged globally, so a higher “minDensity” setting would prevent it from matching, while setting to 0 ensures it is returned.

The “forecastHistory” parameter tells the API how far back to look for the comparison period that defines what “normal” looks like. This is a length of time expressed in seconds that will be subtracted from the “startTime” of “testedInterval” and will become the comparison period. In our query above, we subtract 1,209,600 seconds (14 days * 86,400 seconds in a day) from our startTime of April 15, 2019. Thus, our query examines April 15, 2019 from 00:00:00AM to 11:59:59PM for anomalies by comparing it to the period April 1, 2019 00:00:00AM to April 15, 2019 00:00:00AM. Make sure that this comparison period is larger than your test period and is large enough to represent the kinds of patterns that are “normal” in your dataset.

The “maxPositiveRelativeChange” and “maxNegativeRelativeChange” are used to adjust how an “anomaly” is defined using the formula defined the documentation. We’ll show the impact of changing these in a moment. Similarly, “forecastExtraWeight” can be used to reduce false positives by returning only stronger results, with the precise value being different for each use case and reflecting whether you want as many anomalies as possible or only the strongest. Given that many datasets exhibit marked seasonality/periodicity (such as a surge in mentions of “Monday” on Mondays), the “seasonalityHint” can be used to tell the API about any expected seasonality in the dataset and should be set to the finest grained periodicity expected (in our case our data is expected to have regular intervals at the daily, weekly, monthly and yearly levels, so we set to the most fine-grained, which is “DAILY”).

Finally, for debugging purposes, you can set “returnNonAnomalies” to “true” to get back a list of all values the API examined in the given testedInterval, including those it did not believe were anomalies using the given parameters. This can be especially useful for tuning the API, since you can see what scores were computed for each value.

So, with all of that, let’s see what results we get back!

The API returns 56,941 potential anomalies, with the top five being:

Ile de la Cite
Seine River
Notre Dame
Copley Square
Steamfitters Local 449

Looking through the raw JSON, each record looks like this (not all parameters may be present for all values):

{
  "dimensions": [
    {
      "name": "EntityLOCATION",
      "stringVal": "Seine River"
    }
  ],
  "result": {
      "holdoutErrors": {
      "mdape": 0.1428571428571429,
      "rmd": 0.1428571428571429
  },
  "trainingErrors": {
      "mdape": 0.84615384615384626,
      "rmd": 0.62459546925566334
  },
  "forecastStats": {
      "density": "85",
      "numAnomalies": 1
  },
  "testedIntervalActual": 586,
  "testedIntervalForecast": 9.3333333333333339,
  "testedIntervalForecastLowerBound": 8,
  "testedIntervalForecastUpperBound": 10.666666666666668
  },
  "status": {}
}

This gives us a variety of statistics about the value “Seine River,” but the most important here is that we see that based on its mentions over the two weeks prior to the Notre Dame Cathedral fire, the API expected to see it 9.3 times on April 15th, but instead saw it 586 times, making it a strong anomaly.

In contrast, we see that “Steamfitters Local 449” appeared zero times in the forecast period and 93 in the testedInterval, suggesting this is a novel entity. A closer inspection of news coverage from April 15 shows this surge was related to an Associated Press wire story that was republished by a large number of news outlets that quoted the business manager of Pittsburgh-based Steamfitters Local 449:

{
  "dimensions": [
    {
        "name": "EntityLOCATION",
        "stringVal": "Steamfitters Local 449"
    }
  ],
  "result": {
     "holdoutErrors": {},
     "trainingErrors": {},
     "forecastStats": {
         "numAnomalies": 1
     },
     "testedIntervalActual": 93
  },
  "status": {}
}

What if we want to filter out entries like Steamfitters Local 449 that did not appear in the forecast period? We can simply rerun our query with “minDensity” set to 5 to exclude values that did not appear at least a small number of times from April 1st to the end of the day of April 14th. This yields top five entries:

Ile de la Cite
Seine River
Notre Dame
Copley Square
Hyde Park

Note that Steamfitters Local 449 is no longer present, though Copley Square is and Hyde Park takes the last spot. It turns out the 2019 Boston Marathon occurred on this day, with runners finishing in Copley Square and there was a steady stream of coverage in the preceding week about the upcoming event. Similarly, Hyde Park refers to the “Extinction Rebellion” protest that occurred there that day.

What if we keep “minDensity” at 5 and additionally change “forecastExtraWeight” to 100 to return only the strongest anomalies? The updated top five list is:

Ile de la Cite
Seine River
Notre Dame
Copley Square
Notre Dame Cathedral

This time we see that four of the most important locations are related to the Notre Dame fire, while Boston Marathon’s Copley Square still appears, reflecting the amount of media coverage it typically receives.

What if we keep “minDensity” at 5, “forecastExtraWeight” at 100 and change “maxPositiveRelativeChange” to 1.5? With all those changes we have the following query:

time curl -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/webnlp-201904:query -d '{
  dimensionNames: ["EntityLOCATION"],
  testedInterval: {
     startTime: "2019-04-15T00:00:00Z",
     length: "86400s"
  },
  forecastParams: {
     holdout: 10,
     minDensity: 5,
     forecastHistory: "1209600s",
     maxPositiveRelativeChange: 1.5,
     maxNegativeRelativeChange: 1,
     forecastExtraWeight: 100,
     seasonalityHint: "DAILY",
  },
  returnNonAnomalies: false
}' > RESULTS.TXT

This yields just three anomalies for April 15, all related to the Notre Dame fire:

Ile de la Cite
Seine River
Notre Dame

What if we use the same parameters to look at the following day? Keeping everything the same, we just change “startTime” to “2019-04-16T00:00:00Z” which yields the following top seven locations:

Columbine
Al-Azhar
Sandy Hook
Columbine High School
Strasbourg
Littleton
Warsaw

Immediately clear are a number of locations related to Columbine High School, including nearby Littleton, Colorado. While forgotten today, it was on this day that a Florida teenager “infatuated” with the 1999 Columbine High School shooting flew to the area and purchased a gun in Littleton, amid fears she might attempt a mass shooting on its 20th anniversary. The 2012 Sandy Hook shooting was mentioned heavily in coverage of the event that recounted other school shootings.

Mentions of Al-Azhar relate to condolences by Sheikh Ahmed Al-Tayeb, the Grand Imam of Al-Azhar in Egypt, regarding the Notre Dame fire, which were widely covered. Mentions of Strasbourg refer to the ringing of its cathedral’s bell for 15 minutes to mark solidarity with Notre Dame, while Warsaw’s prominence relates to coverage of how it rebuilt after World War II.

Note that mentions of Notre Dame itself, Ile de la Cite, the Seine River, etc, do not appear as anomalies on April 16, despite the story receiving continued heavy coverage. This is because, based on our query parameters, the API is looking for only the strongest anomalies and since Notre Dame received heavy coverage on April 15, its continued heavy coverage on April 16 is not an anomaly under our criteria, demonstrating the power of the Timeseries API to look past saturation coverage of yesterday’s events to find tomorrow’s stories.

Of course, analyzing a single day with just two weeks of history is a fairly trivial use case. Where the Timeseries Insights API’s real power lies is in its ability to scale up to massive datasets and sift out the temporal patterns in just seconds. What if we repeated the process above to surface the anomalies for each day from January 1, 2019 through April 30, 2021? While we will be running this analysis after the fact, the API only examines the dataset up to the day we’re examining, ignoring the rest of the dataset, so these are the results you would have gotten had you run the API that day.

Here we use the same BigQuery query as before, but export all entries from January 1, 2019 through April 30, 2021:

WITH data AS (select groupId, STRUCT(name, a.entity as stringVal) as dimensions, eventTime from (
    SELECT FARM_FINGERPRINT(url) groupId, CONCAT('Entity',entity.type) name, FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", date, "UTC") eventTime, entity.mid mid FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, UNNEST(entities) entity WHERE entity.mid is not null 
       and lang='en' and DATE(date) >= "2019-01-01" and DATE(date) <= "2021-04-30"
  ) b JOIN (
     SELECT APPROX_TOP_COUNT(entities.name, 1)[OFFSET(0)].value entity, entities.mid mid FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities where entities.mid is not null and lang='en' and DATE(date) >= "2019-01-01" and DATE(date) <= "2021-04-30" group by entities.mid
  ) a USING(mid)
)
SELECT eventTime, groupId, ARRAY_AGG(dimensions) AS dimensions FROM data GROUP BY eventTime, groupId

This yields a final dataset spanning 1.2 billion entity mentions across 76.8 million articles over 850 days. Using the SQL query “SELECT count(distinct(dimensions[OFFSET(0)].stringVal))” over the resulting temporary table, we see that there are 12.3 million distinct entity values. Since each entity is its own time series, that means our final dataset contains 12.3 million time series covering 850 days totaling 1.2 billion datapoints!

Once again we use the same request to load the results into the Timeseries API:

time curl -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[YOURPROJECTID]/datasets -d '{
  "name":"webnlp-201901-202104",
  "dataNames": [
     "EntityCONSUMER_GOOD",
     "EntityEVENT",
     "EntityLOCATION",
     "EntityORGANIZATION",
     "EntityOTHER",
     "EntityPERSON",
     "EntityUNKNOWN",
     "EntityWORK_OF_ART",
  ],
  "dataSources": [
     { "uri":"gs://data.gdeltproject.org/blog/2021-timeseries-insights-api/datasets/webnlp-201901-202104.json" }
  ]
}'

Once the data is loaded, we use the exact same query as above and repeat it for each day January 1, 2019 through April 30, 2021 (just replace the “[YYYY-MM-DD]” with each day in succession:

time curl -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/webnlp-201901-202104:query -d '{
  dimensionNames: ["EntityLOCATION"],
  testedInterval: {
     startTime: "[YYYY-MM-DD]T00:00:00Z",
     length: "86400s"
  },
  forecastParams: {
     holdout: 10,
     minDensity: 5,
     forecastHistory: "1209600s",
     maxPositiveRelativeChange: 1.5,
     maxNegativeRelativeChange: 1,
     forecastExtraWeight: 100,
     seasonalityHint: "DAILY",
  },
  returnNonAnomalies: false
}' > RESULTS.TXT

You can download the complete day-by-day results, but here are a few examples:

2019-01-28 Trader Joe's,Shrine Auditorium,Ozark
2019-05-06 Frogmore Cottage,Red Sea,Arabian Peninsula,Mediterranean,Persian Gulf,St George's Chapel,Harry,Windsor,Windsor Castle,Kate Middleton,Arabian Sea,Trump Tower,Kensington Palace,Princess Charlotte,Buckingham Palace,OtisvilleTrader Joe's,Shrine Auditorium,Ozark
2020-01-24 Shanghai Disneyland,Shanghai Disney Resort,Elazig,Navarre,Jingmen,Death Valley,Jadriya,Buckeye,Qianjiang,Heilongjiang,Great Falls,Hebei,Xiantao,Forbidden City
2020-04-30 Andrew T. Cleckley Funeral Home,Hingham,Sir HN Reliance Foundation Hospital,Army Foundation College,Miami-Dade County,Bedfordshire,HMCS Fredericton,Keighley,Broward County,Palm Beach County,Marston Moretaine
2020-05-21 Westgate Entertainment District,AR-15,Chechnya,Kaliningrad,Hyogo,Glendale,Otisville,Corpus Christi
2021-01-06 Capitol Rotunda,Le Gavroche,Cannon House Office Building,Capitol,DeKalb County,Buckhead,Ebenezer Baptist Church,Gwinnett County,Pyongyang,Stevenage,Capitol Hill,Xingtai
2021-01-20 South Lawn,Park City,West Palm Beach,South Asia,Tomb of the Unknown Soldier,Blair House,Joint Base Andrews,Arlington National Cemetery,Mar-a-Lago,Oval Office,Palm Beach,Cathedral of St. Matthew the Apostle
2021-03-24 Rotterdam,Circle in the Square,Suez Canal,Great Bitter Lake,Insein Prison,Great Neck,Red Sea,Sinai Peninsula,Mediterranean,Panama,Bordeaux,Far East,Santa Rosa

The appearance of Trader Joe’s on January 28, 2019 is due to heavy media coverage of actor Geoffrey Owen’s inspiring story, while May 6, 2019’s density of British royalty-related locations is due to the birth of Meghan Markle and Prince Harry’s child. The shuttering of Shanghai Disney leads January 24, 2020, while April 30, 2020’s Andrew T. Cleckley Funeral Home refers to a pandemic-era story that received considerable attention. May 21, 2020’s Westgate Entertainment District refers to a mass shooting, while January 6, 2021’s density of US Capitol-related locations refers to the events of that day and January 20, 2021’s locations capture President Biden’s inauguration. March 24, 2021’s Suez Canal refers to the grounding of the container ship Ever Given.

Look closely and the entries for each day capture a cross-section of the major stories of the past two and a half years. Remember that the underlying dataset is only a small random sample of English language online coverage each day and that we have asked the Timeseries API for only the most significant anomalies for each day, so this list captures just a fraction of global events. Nevertheless, it showcases just how powerful even this most simplistic of approaches is in teasing out the patterns of daily society.

In reality, of course, you would be running this query in realtime, every few minutes to catch trends as they first emerge, rather than retrospectively like this. Making this possible is the fact that the API takes just a few seconds for each query and can run large numbers of queries in parallel for maximum scalability.

What about television news? The Media-Data Research Consortium (M-DRC), with whom GDELT has been working closely to analyze television news, was awarded last year a Google Cloud COVID-19 Research Grant to support “Quantifying the COVID-19 Public Health Media Narrative Through TV & Radio News Analysis,” including using Google’s Cloud Video AI API to non-consumptively analyze portions of the Internet Archive’s Television News Archive to create the Visual Global Entity Graph 2.0 (VGEG).

The Visual Global Entity Graph 2.0 includes the complete second-by-second onscreen text transcribed by Cloud Video AI’s multilingual OCR for BBC News London, CNN, MSNBC, Fox News and the ABC, CBS and NBC evening news broadcasts. What if we applied the same trending analysis from earlier to this OCR content, examining the daily onscreen text of CNN over the first seven months of the Covid-19 pandemic to chronicle the most significant words appearing onscreen each day?

The Visual Global Entity Graph analyzes the 24/7 airtime of CNN each day, which includes both news and advertising airtime. Using the closed captioning’s “Caption Mode” information, we are able to determine whether each line of captioning is advertising or news programming with 100% accuracy since it reflects the channel’s own switching information of whether it is displaying a news or advertising signal. Since the time codes embedded in closed captioning lags the actual video signal by several seconds with a non-uniform delay, we combine it with the precision automated speech recognition (ASR) of Cloud Video AI, producing a precise catalog of each broadcast that indicates whether each individual second of airtime is news or advertising. Due to the difficulty of aligning uncaptioned transitions between news and advertising using ASR, there is a small false positive rate of advertising being counted as news and vice-versa due to timecode interpolation in this process (since advertising information is in “caption time” but we need to interpolate it into “video time”), but overall the dataset is extremely accurate.

The Visual Global Entity Graph tells us the OCR’d text for each second of airtime, while this advertisement catalog tells us whether that second is news or advertising programming, making it trivial to use a simple SQL “where date not in” clause to exclude almost all advertising from our analysis. While a small percentage of stray seconds of advertising airtime text will still squeak through due to this caption-to-video-time interpolation, this allows us to focus our analysis almost entirely on news airtime text. Using BigQuery’s built-in ML.NGRAMS() ngramming function, it is trivial to convert a given second of OCR’d text into a unigram table, while we use BigQuery’s built-in regular expression support to exclude any unigram that does not include at least one letter (this excludes all of the date and time-related onscreen text).

The end result is a simple SQL query that constructs a daily unigram table from the onscreen text of CNN from January 1, 2020 through July 31, 2020 in the input format needed by the Timeseries API:

with data as (
SELECT FARM_FINGERPRINT(iaClipUrl) groupId, STRUCT('Ngram' as name, ngram as stringVal) as dimensions, FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", date, "UTC") eventTime FROM `gdelt-bq.gdeltv2.vgegv2_iatv`, UNNEST(ML.NGRAMS(SPLIT(REGEXP_REPLACE(LOWER(OCRText), r'(\pP)', r' \1 '), ' '), [1,1], ' ')) as ngram WHERE length(ngram) > 2 and REGEXP_CONTAINS(ngram, r'[A-Za-z]+') and station='CNN' and DATE(date) >= "2020-01-01" and DATE(date) <= "2020-07-31" and date not in (SELECT date FROM `gdelt-bq.gdeltv2.iatv_aif_vidtime` WHERE DATE(date) >= "2020-01-01" and DATE(date) <= "2020-07-31" and station='CNN' and type!='NEWS')
) select eventTime, groupId, ARRAY_AGG(dimensions) AS dimensions FROM data GROUP BY eventTime, groupId

This yields a dataset of 333 million unique time series.

We then load the results of this query into the Timeseries API using:

time curl -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[YOURPROJECTID]/datasets -d '{
  "name":"tvocr-cnn-202001-202007-1grams",
  "dataNames": [
     "Ngram"
  ],
  "dataSources": [
     { "uri":"gs://data.gdeltproject.org/blog/2021-timeseries-insights-api/datasets/tvocr-cnn-202001-202007-1grams.json" }
  ]
}'

We can then query for anomalous words appearing in the onscreen text of a given day on CNN. Would the API have flagged the moment when CNN first began to cover the Covid-19 pandemic heavily? Using the TV AI Explorer to search for onscreen textual mentions of the word “virus” during January 2020, we see it first begins to surge on January 18, 2020. A manual review of the small number of mentions earlier in the month shows they were focused on US flu cases and a handful of drug advertisements (the TV AI Explorer doesn’t filter out advertising airtime). Mentions of “virus” begin trending on January 17, 2020, but only begin increasing dramatically on January 18th. We could easily adjust our query parameters to tell the Timeseries API to return even small anomalies like January 17, but for the purposes of this example we’ll simply reuse the same parameters as before in order to detect only the biggest anomalies each day. The only change we’ll make is to reduce the “forecastHistory” to narrow our comparison period to just the last seven days, since onscreen text tends to repeat heavily over longer time horizons and be highly bursty, so this smaller horizon ensures better sensitivity for OCR text.

time curl -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/tvocr-cnn-202001-202007-1grams:query -d '{
  dimensionNames: ["Ngram"],
  testedInterval: {
     startTime: "2020-01-18T00:00:00Z",
     length: "86400s"
  },
  forecastParams: {
     holdout: 10,
     minDensity: 5,
     forecastHistory: "604800s",
     maxPositiveRelativeChange: 1,
     maxNegativeRelativeChange: 1,
     forecastExtraWeight: 150,
     seasonalityHint: "DAILY",
  },
  returnNonAnomalies: false
}' > RESULTS.TXT

This yields the following list of anomalous onscreen words that day capturing a range of events drawing attention in mid-January last year:

news,breaking,texts,royal,newsroom,documents,harry,meghan,nunes,longer,highness,titles,from,release,images,dems,london,getty,women,use,family,indianapolis,per,suggest,delta,anderson,lancaster,ambassador,show,scene,police,give,cooper,prince,working,markle,members,spied,omaha,groups,city,aide,voice,san,secupp,memphis,communication,arson,erson,march,brief,virginia,reveals,cleveland,king,cupp,smerconish,sanders,denver,soleimani,hypocrisy,democrats,happening,results,raleigh,queen,unveils,weather,professor,likes,minneapolis,court,cincinnati,archives,louis,foster,judiciary,columbus,weinstein,portland,friday,between,couple,palace,book,dem,seattle,attorney,primary,west,bre,destruction,palm,details,haven,back,critical,cooper3,election,across,high,rights,pittsburgh,called,nashville,buckingham,sends,world,arbiter,sir,forward,max,response,she,francisco,kiev,feud,babies,dopes,alert,virus,governor,dump,duke,duchess,sussex,recounts,diego,emeritus,vegas,lake,warren,commentator,anti,sacramento,las,salt,unfiltered,lowa,devin,represent,top,5pm,deadlines,saint,crimes,fight,duties,brea,leading,albuquerque,quick,millions,australia,member,richmond,orleans,tep,climate,son,kansas,fuel,come,play,americans,say,lebanon,reporting,misdemeanors,bad,madison,puerto,rico,implicate,beach,made,death,snow,community,change,poper,key,spent,file,support,firearms,eaking,steyer,fired,beirut,bunch,message,launches,england,protesters,criminal,cani,person,generals,african,nderson,make,said,gas,wisconsin,protests,aking,tammy,ews,richard,violent,turn,baldwin,passengers,daily,tear,probes,victoria,corruption,florida,clash,god,party,barr,reaking,elizabeth,korea,dirty,quest,power,moines,des,airline,cdc,harvey,winter,tom,schools,vermont,struggles,human,debate,severe,face,pelosi,capital,taking,doug,alabama,enforcement,fears,break,caucuses,increased,coons,eric,infected,wuhan,stepping,role,allegations,means,testing,dirt,edge,looking,bob,town,wants,pick,tennessee,rift,study,seconds,deadly,includes,rivals,pause,percent,much,like,offer,hillary,springs,airports,kim,endorse,bushfires,recover,lerson,near,believes,jet,follows,impartiality,once,filtered,austin,republican,immigrants,ice,insider,think,jersey,storm,extremist,bans,view,hall,facebook,agent,definition,bush,hurts,together,unity,red,abused,congressman,speaks,trading,ing,jury,thousands,accuse,prison,came,colchester,photos,streets,watch,continue,temporarily,anchor,guidelines,releases,activists,fighting,derek,phone,worry,whitewater,focus,indicating,facing,outbreak,sean,nancy,representative,roll,reportedly,island,martin,building,sarah,dig,push,imminent,ered,trialcnn,civil,sbre,alerts,moments,fast,earthquakes,westwood,reveal,leaders,ponce,importance,same,ground,missiles,wendy,johnson,oklahoma,writer,ana,kline,daughter,ang,limbo,fate,michelle,hbo,admitted,oncn,thinks,ipsos,bair,fdic,hunter,12pm,luther,loved,patronages,fitzwilliams,milwaukee,watching,brian,less,unfilter,association,tered,kasparian,arriving,rebecca,buck,standards,speaker,sentinel,minute,billings,lawsuit,producer,turks,terry,mcauliffe,newspapers,estimates,limit,unf,haynes,ideas

Midway through the list is the word “virus” showing that the API successfully identified the surge of mentions that day. In a real-world application, rather than unigrams you might apply part of speech tagging to the OCR results to extract entire noun phrases, which would yield more intelligible and meaningful results than isolated words. Yet, even this simplistic approach of converting each second of airtime to a list of words was enough for the Timeseries API to tease out the first real mentions of the pandemic on CNN!

Repeating this query for each individual day from January 1, 2020 through July 31, 2020 yields a daily list of the most anomalous words appearing onscreen each day. Masking and vaccination both first appear as anomalies on February 8th, showing how early they began to be heavily mentioned, while “Covid” first trends on March 19.

You can download the complete day-by-day results to see how the stories of the first half of last year are reflected in the data. Remember that the parameters used here are designed to search for just the most significant anomalies, but could be easily adjusted to filter for different kinds of outliers.

Of course, American television news was rather late in covering the growing pandemic, beginning its first serious coverage in late January and not ramping up significantly until late February.

Biosurveillance company Bluedot sent one of the earliest alerts of the impending Covid-19 pandemic on December 31, 2019 by scanning GDELT’s Global Knowledge Graph (GKG) for certain kinds of disease mentions. Had the Timeseries API been available back then, could it have similarly picked up the first glimmers of the outbreak in Wuhan simply by detecting a surge in local Chinese-language news coverage about an unknown disease outbreak?

When Covid-19 first began to be noticed by hospitals in Wuhan, it was described as an “unexplained pneumonia,” meaning we would want to search for sudden increases in news coverage of pneumonia outbreaks. Moreover, an automated system for identifying high-profile pneumonia outbreaks globally might be useful in general to detect future respiratory disease outbreaks.

The Global Knowledge Graph uses machine translation to examine all worldwide online news coverage in 65 of the 152 languages GDELT monitors and compiles a list of geographic locations and thematic topics mentioned in that coverage, along with their relative mention locations in each article. Combining these two datasets we can quickly compile the list of locations most closely associated with mentions of pneumonia by day over a given time period. Indeed, this approach clearly identifies the early glimmers of the outbreak in Wuhan around 10PM EST on December 30, 2019.

Using BigQuery’s User Defined Function (UDF) capability, we can write a custom JavaScript function to merge the location and thematic entries for each article in the GKG, outputting a list of locations and their associated topics and their source URLs. The Timeseries API’s scalability means we could load this entire dataset into the API and use its “pinnedDimensions” to examine a given topic like “pneumonia” for location anomalies. Having the entire dataset loaded would allow us to rapidly examine any topic or combination of topics for maximal flexibility, making interactive “what if” exploration of the dataset possible in realtime. On the other hand, a fixed production monitoring application searching for disease outbreaks could also just load locations associated with disease to create a smaller dataset for anomaly detection. The API is scalable enough to support either scenario with ease.

For the purposes of this example, we’ll create a small targeted dataset containing just locations mentioned in articles within 150 characters of a mention of pneumonia using the following BigQuery query:

CREATE TEMP FUNCTION GKGThemeListByLocation(V2Locations STRING, V2Themes STRING)
RETURNS ARRAY<STRUCT<LocationType INT64, LocationName STRING, Latitude STRING, Longitude STRING, AssociatedThemes STRING>>
LANGUAGE js AS r"""
  var MAX_DISTANCE = 150;
  var results = [];
  if (V2Locations === null || V2Themes === null) {
    return;
  }
  var locations = String(V2Locations).split(';');
  var themes = String(V2Themes).split(';');

  for (var location of locations) {
    var locationFields = location.split('#');
    if (locationFields.length < 2) {
      continue;
    }
    var locationOffset = parseInt(locationFields[locationFields.length - 1], 10);
    var closeThemes = [];

    for(var theme of themes) {
      var themeFields = theme.split(',');

      if (themeFields.length < 2) { continue; }
      var themeName = themeFields[0];
      var themeOffset = parseInt(themeFields[1], 10);
      if ((locationOffset > themeOffset && locationOffset - themeOffset < MAX_DISTANCE)
          || (locationOffset < themeOffset && themeOffset - locationOffset < MAX_DISTANCE)) {
               //closeThemes.push(themeName + ',' + themeOffset); this emits the theme offsets if desired
               closeThemes.push(themeName);
      }
    } //end loop over each theme

    if (closeThemes.length > 0) {
       results.push( { LocationType: locationFields[0], LocationName: locationFields[1], Latitude: locationFields[5], Longitude: locationFields[6], AssociatedThemes: closeThemes.join(';') } ); 
    }

   } //end loop over each location
   return results;
""";

with data as (
  SELECT FARM_FINGERPRINT(GENERATE_UUID()) groupId, STRUCT('LocationName' AS name, LocationName AS stringVal) as dimensions, FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", PARSE_TIMESTAMP("%Y%m%d%H%M%S", CAST(DATE AS STRING), "UTC"), "UTC") eventTime FROM (
    SELECT ANY_VALUE(DATE) DATE, REGEXP_EXTRACT(Extras, r'<PAGE_TITLE>(.*?)</PAGE_TITLE>') PageTitle, GKGThemeListByLocation( ANY_VALUE(V2Locations), ANY_VALUE(V2Themes) ) geothemes FROM `gdelt-bq.gdeltv2.gkg_partitioned` WHERE DATE(_PARTITIONTIME) >= "2019-11-01" and DATE(_PARTITIONTIME) <= "2020-01-31"
      and (Themes like '%TAX_DISEASE_PNEUMONIA%') group by PageTitle
  ), unnest(geothemes) geotheme where geotheme.AssociatedThemes like '%TAX_DISEASE_PNEUMONIA%' and geotheme.LocationType > 1
) select eventTime, groupId, ARRAY_AGG(dimensions) as dimensions from data group by eventTime, groupId

While there is a lot of code above, the majority of it is the JavaScript function that merges the location and thematic data and formats the output into the Timeseries API’s input format. We also collapse on article title as a simple filter to remove duplicate wire stories that go viral, since otherwise a single major story could skew our results.

Loading the results into the Timeseries API is as simple as:

time curl -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[YOURPROJECTID]/datasets -d '{
  "name":"ggg-pneumonia-20191101-20200131",
  "dataNames": [
     "LocationName"
  ],
  "dataSources": [
     { "uri":"gs://data.gdeltproject.org/blog/2021-timeseries-insights-api/datasets/ggg-pneumonia-20191101-20200131.json" }
  ]
}'

Querying it is as simple as:

time curl -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/ggg-pneumonia-20191101-20200131:query -d '{
  dimensionNames: ["LocationName"],
  testedInterval: {
     startTime: "2019-12-31T00:00:00Z",
     length: "86400s"
  },
  forecastParams: {
     holdout: 10.0,
     minDensity: 5,
     forecastHistory: "1209600s",
     maxPositiveRelativeChange: 1,
     maxNegativeRelativeChange: 1,
     forecastExtraWeight: 150,
     seasonalityHint: "DAILY",
  },
  returnNonAnomalies: false
}' > RESULTS.TXT

This yields the single result:

"anomalies": [
  {
  "dimensions": [
    {
       "name": "LocationName",
       "stringVal": "Wuhan, Hubei, China"
    }
  ],
  "result": {
     "holdoutErrors": {},
     "trainingErrors": {
        "mdape": 1,
        "rmd": 1.333333333333333
     },
     "forecastStats": {
        "density": "27",
        "numAnomalies": 1
     },
     "testedIntervalActual": 774.66666666666663
  },
  "status": {}
  }
]

In short, had the Timeseries API been analyzing GDELT in December 2019, on December 31, 2020 in the UTC time zone (actually December 30th in US time zones), it would have flagged an anomalous location associated with a sharp surge in mentions of pneumonia: Wuhan, China!

Running this query for each day from November 1, 2019 to January 31, 2020 yields no results until the December 31st alert for Wuhan, demonstrating the API’s ability to effectively minimize false positives. No further results are returned until January 20, 2020, when the API flags Guangdong and Beijing as anomalies as the virus spread to those cities and the first human-to-human transmission was confirmed, while Guangdong, Shanghai and Wuhan were returned on January 21st as the virus continued to spread and the implications of the new virus’ ability to spread via human to human contact ricocheted across the world’s media. No further results were seen through the end of the month and “coronavirus” took over as the primary term used to describe the evolving pandemic rather than “pneumonia.”

In a real-life application, we’d want to monitor the world in realtime for outbreaks, rather than daily. In fact, the API allows for any time range of analysis measured in fractional seconds. Given that GDELT 2.0 currently updates every 15 minutes (the forthcoming GDELT 3.0 will update every minute), what would have looked like the evening of December 30th, 2019 in the EST time zone if the Timeseries API had been examining GDELT every hour for location anomalies associated with pneumonia?

Since the Timeseries API represents timestamps internally at precise subsecond resolution, we don’t need to reload our data, we can just reuse our daily query from above and make a few adjustments for the shorter time horizons of this query:

time curl -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[YOURPROJECTID]/datasets/ggg-pneumonia-20191101-20200131:query -d '{
  dimensionNames: ["LocationName"],
  testedInterval: {
     startTime: "2019-12-31T00:00:00Z",
     length: "1800s"
  },
  forecastParams: {
     holdout: 3,
     minDensity: 5,
     forecastHistory: "86400s",
     maxPositiveRelativeChange: 1,
     maxNegativeRelativeChange: 1,
     forecastExtraWeight: 10,
     seasonalityHint: "DAILY",
  },
  returnNonAnomalies: false
}' > RESULTS.TXT

In this case we’ll use a rolling one-hour window looking backwards from the current time, so we set “length” under “testedInterval” to “1800s” (one hour). Given our shorter time horizon, we’ll also change “forecastHistory” to “86400s” (one day) so we are comparing the most recent hour against the most recent rolling 24 hours to maximize sensitivity. Since our dataset is relatively sparse, we’ll also reduce “holdout” to “3” which is the minimum value recommended in the documentation. The “holdout” parameter tells the API what percentage of the data to use for accuracy evaluation, but since our dataset is extremely small, we’ll set it to the lowest recommended value. We’ll also reduce “forecastExtraWeight” to a much smaller value of “10” since we’re expecting hourly surges to be relatively small so we want a small non-zero value here to minimize false positives.

What would it have looked like if we had been running the Timeseries API every hour the morning of December 31 (UTC timezone) using the query above? From midnight to 2:45AM UTC, we get no results, but at 3AM UTC (10PM EST on December 30, 2020), the API returns Wuhan as an anomaly. We get a second anomaly for Wuhan an hour later at 11PM.

Thus, had the API been running on GDELT the evening of December 30, 2019 in New York, at 10PM EST it would have flagged Wuhan, China as an anomalous location associated with a surge in mentions of it with respect to pneumonia, followed by a second anomaly at 11PM EST.

The international community was not carefully watching Wuhan the evening of December 30, 2019 looking for a coronavirus outbreak. Yet had the Timeseries API been monitoring GDELT’s GKG for pneumonia anomalies that evening, at 10PM EST it would have flagged the first public glimmers of Covid-19!

Putting this all together, we’ve shown how the new Google Cloud Timeseries Insights API can sift through GDELT’s massive datasets to identify the earliest glimmers of tomorrow’s major stories and do so across a range of modalities from English online news annotated through Google’s Cloud Natural Language API to television news OCR’d through Cloud Video AI API to machine translated and geocoded online news in 65 languages and along the way we showed how it even flags the first glimmers of Covid-19 at 10PM EST on December 30, 2019!

The GDELT Project

Identifying Breaking News Stories Across The World With Google’s Timeseries Insights API

Archives