Identifying Trending Television News Narratives In Realtime With The GCP Timeseries Insights API & Video AI API

Over nearly a decade, GDELT has been collaborating with the Internet Archive's Television News Archive to explore how advanced non-consumptive computational approaches can help scholars, journalists, fact checkers and even ordinary citizens understand the patterns of television news. In addition to analyzing the closed captioning of every broadcast, we use Google’s Cloud Video AI API to perform multilingual OCR over BBC News London, CNN, Fox News and MSNBC every day, compiling a second-by-second transcript of their onscreen text to create the Visual Global Entity Graph 2.0 (VGEG).

This onscreen text is an incredibly unique and powerful signal in that it represents live editorialization and summarization of the events being depicted onscreen, much as an online news article's title offers an editorialized summary of its contents. To distinguish news content from advertising, we use a special hidden ground truth data field in the closed captioning stream that specifies whether each line of captioning is advertising or news content and then use the Video AI API's ASR subsecond speech transcription to align this information with the live video. The end result is a second-by-second feed of the onscreen news programming text of BBC, CNN, Fox News and MSNBC, classified as news or advertising.

Journalists and scholars can keyword search this OCR'd text or run regular expressions across it, but until today there has been no way to search the dataset for trending stories and narratives. Within this livestream of OCR'd text stands a realtime compilation of editorializations of global events and narratives, with its size and scale (345,600 seconds of fullscreen OCR'd text every day across the four channels, often with myriad languages appearing in coverage of international events) offering an unprecedented opportunity to explore how narratives and events enter and evolve through the media landscape.

Today we are tremendously excited to announce a new experimental dataset that does exactly that, coupling the Timeseries Insights API with the Video AI API to provide a first glimpse at what realtime AI-powered OCR trend analysis and anomaly detection looks like over television news.

Pipeline Overview

Below is a simplified version of the underlying pipeline:

  • Video OCR. Every 30 minutes we use the Video AI API to non-consumptively analyze the latest broadcasts that the Internet Archive's Television News Archive has monitored from BBC News London, CNN, Fox News and MSNBC. There is typically a many-hour delay between the broadcast airing and the Archive preparing the broadcast for processing, so the broadcasts that become available for processing every 30 minutes are from many hours ago. The Video AI API's OCR process takes just minutes, so in a production system you could easily accelerate this process to execute with just minutes or even just seconds of lagtime behind the live broadcast.
  • Advertising Identification. Using a special hidden field embedded in the closed captioning, we extract the station-provided ground truth identifying each line of captioning as advertising or news content.
  • Captioning To Video Time Conversion. The station-provided advertising information is in "captioning time" which typically lags the actual video stream by a variable length of 2-10 seconds that changes constantly throughout the broadcast. To correct for this, we use the Video AI API's frame-level ASR speech recognition timecodes to precisely align the station-provided captioning transcript to subsecond accuracy with the video file and copy the advertising/news indicators over to the video timecodes.
  • Timeseries Ingest. The Video AI API's frame-level OCR output for each broadcast is aggregated into per-second compilations of the onscreen text (via the Visual Global Entity Graph 2.0) and filtered using the advertising data to remove all advertising airtime, with the final news airtime OCR seconds being broken into unigrams and bigrams and appended to the Timeseries Insights API.
  • Timeseries Detection. Every 30 minutes, GDELT asks the Timeseries Insights API to identify all anomalous entities that appear to be sharply trending in the past 12 hours (using a rolling window) compared with the previous three days. For the moment we're computing this every 30 minutes given the many-hours delay in the Archive preparing broadcasts for processing.

Downloading The Dataset

If you want to just dive into the resulting dataset without worrying about the details, download the following URLs every 30 minutes (at :00 and :30 after each hour). Note that because the Internet Archive has a long delay between broadcasts airing and their becoming ready to process, not updates may have sufficient broadcasts to yield strong anomaly results, so some 12-hour periods may be missing one or more files or they may have no entries.

The date format is "YYYYMMDDHHMMSS" and since the first available date/time is February 15, 2022 at 17:30 UTC, the first files are:

  • http://data.gdeltproject.org/gdeltv3/timeseriesinsights/tv_ocr/20220215173000.bbc.anomalies.json.gz
  • http://data.gdeltproject.org/gdeltv3/timeseriesinsights/tv_ocr/20220215173000.cnn.anomalies.json.gz
  • http://data.gdeltproject.org/gdeltv3/timeseriesinsights/tv_ocr/20220215173000.fox.anomalies.json.gz
  • http://data.gdeltproject.org/gdeltv3/timeseriesinsights/tv_ocr/20220215173000.msnbc.anomalies.json.gz

To understand the format of the files above and what all of the various fields mean, see the Timeseries Insights API "ForecastResult" documentation.

Creating The Dataset

The first step in creating this dataset was to create an initial static "seed" dataset containing a week of OCR ngrams from the Visual Global Entity Graph 2.0. Since BBC News London does not have commercial advertising, it uses a simplified query that does not filter for advertising versus news airtime:

with data as (
SELECT FARM_FINGERPRINT(iaClipUrl) groupId, STRUCT('Ngram' as name, INITCAP(ngram, " -") as stringVal) as dimensions, FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", date, "UTC") eventTime FROM `gdelt-bq.gdeltv2.vgegv2_iatv`,
  UNNEST(ML.NGRAMS(SPLIT(
    TRIM(REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( LOWER(OCRText), r'\d+:\d+\s*(Am|Pm)*',' ') , r'([^\PP\-\'])', r' \1 '), r'[\pZ\s]+', ' '))
  , ' '), [1,2], ' ')) as ngram where length(ngram) > 3 AND NOT REGEXP_CONTAINS(ngram, r'([,.?!|:\/\\"•<>{}\[\]=+*&]| [\-\']|[\'\-] )')
and station='BBCNEWS' and DATE(date) >= "2022-02-03" and DATE(date) <= "2022-02-11"
) select eventTime, groupId, ARRAY_AGG(dimensions) AS dimensions FROM data GROUP BY eventTime, groupId

At first glance, the query might seem a bit complex, but that is largely due to the regular expressions used to prefilter the ngrams. You can see that at its core, the query compiles a list of all of the OCR'd onscreen text recognized by the Video AI API for BBC News from February 3, 2022 to February 11, 2022, aggregated by second. For each second of OCR'd airtime, it lowercases the text (since television news typically uses a mixture of upper, lower and titlecase text that reflects visual styling rather than meaning), then applies a series of regular expressions to clean the output. The first pass removes timestamps, since many stations display the live time somewhere onscreen, which obviously varies through the course of the day and would yield meaningless anomalies (over a longer time horizon the API would actually learn to ignore this via seasonality, but since we are looking at only a 3-day window we filter it out). The second pass splits the text on all punctuation other than hyphens, using the Unicode-aware punctuation character class. All spans of one or more spaces are then collapsed to single spaces.

The resulting cleaned text is then passed to BigQuery's built-in ML.NGRAMS() function that handles all of the ngramming work, moving a rolling window over the text to compile all one and two word shingles and outputing this as an array. Each unigram (single word) and bigram (two word phrase) is then checked for length and ngrams that are only 1 or 2 characters in length are discarded. This will also remove the majority of punctuation, since our earlier regular expression placed spaces around all punctuation. However, on occasion odd OCR error or typographical or technical errors in the onscreen text will cause punctuation to survive this initial filtering, so this removes those entries, as well as entries that start or end with hyphens (rather than having the hyphen or apostrophe appearing within the word).

Finally, each unigram and bigram that survived all of this filtering is titlecased and compiled into an array, together with the FarmHash of the combined Internet Archive unique identifier for this broadcast with the specific second of airtime, which is combined to yield a unique numeric "groupId" for this second of airtime. The final query from the Natural Language API. We used BigQuery to create an extract from GDELT's Global Entity Graph dataset. You can see the extract is quite simple. We compile all of the entities from English language articles published from January 15 to 23, 2022, titlecase them, organize by type and collapse into an array under the URL. Since the Timeseries Insights API requires that events be organized under a numeric groupID, we use the FarmHash of the URL to generate the necessary 64-bit numeric ID. This static dataset will act as the initial historical comparison period for the dataset.

For CNN, Fox News and MSNBC, the query is a bit more complicated, since we have to filter out advertising:

with data as (
SELECT FARM_FINGERPRINT(iaClipUrl) groupId, STRUCT('Ngram' as name, INITCAP(ngram, " -") as stringVal) as dimensions, FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", date, "UTC") eventTime FROM `gdelt-bq.gdeltv2.vgegv2_iatv`,
  UNNEST(ML.NGRAMS(SPLIT(
    TRIM(REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( LOWER(OCRText), r'\d+:\d+\s*(Am|Pm)*',' ') , r'([^\PP\-\'])', r' \1 '), r'[\pZ\s]+', ' '))
  , ' '), [1,2], ' ')) as ngram where length(ngram) > 3 AND NOT REGEXP_CONTAINS(ngram, r'([,.?!|:\/\\"•<>{}\[\]=+*&]| [\-\']|[\'\-] )')
and station='CNN' and DATE(date) >= "2022-02-03" and DATE(date) <= "2022-02-11"
and date not in (SELECT date FROM `gdelt-bq.gdeltv2.iatv_aif_vidtime` WHERE DATE(date) >= "2022-02-03" and DATE(date) <= "2022-02-11" and station='CNN' and type!='NEWS')
) select eventTime, groupId, ARRAY_AGG(dimensions) AS dimensions FROM data GROUP BY eventTime, groupId

Here the query is nearly identical, but with the addition of the "and date not in" clause that filters according to the ASR-aligned "video time" advertising information dataset. The query above is repeated individually for each of the three channels.

Since BigQuery shards exports into many small files, we reassemble into a single JSON file:

gsutil -q cat gs://[BUCKET]/times-bbc-*.json | gsutil cp - gs://[BUCKET]/tvocr-20220203-20220211-bbc.json

We then create the new dataset in the Timeseries Insights API and seed it using this initial static historical dataset. The command below tells the API to create the new dataset, tells it what fields to expect, tells it that we expect to stream new dataset into it after it is created and gives it the GCS path of the static dataset:

time curl -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[PROJECTID]/datasets -d '{
  "name":"tvocr-bbc-production",
  "dataNames": [
    "Ngram"
  ],
  "dataSources": [
    { "uri":"gs://data.gdeltproject.org/blog/2021-timeseries-insights-api/datasets/tvocr-20220203-20220211-bbc.json" }
  ]
}'

The API will now create the dataset. We can check its status to see when it has finished loading:

curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[PROJECTID]/datasets

Once the dataset has been created, we can now append to it in realtime!

Every 30 minutes, we have a PERL script that essentially repeats the BigQuery query above, but narrowed to search just the previous 30 minutes for new broadcasts that have completed Internet Archive processing, Video AI API annotation, ASR-based captioning alignment and advertising information file compilation. Typically the Internet Archive completes processing videos a number of hours after airing, so the shows that become available every 30 minutes did not air in the last 30 minutes (they likely aired many hours ago), but have just completed the entire processing pipeline in the last 30 minutes and so are ready to be appended to the Timeseries Insights API.

While FarmHash is readily available in PERL, it is not easily accessible in every scripting language, so to make a more universal solution, we replace the "FARM_FINGERPRINT(iaClipUrl)" of the BigQuery query with a bit of creative PERL that converts the URL to UTF8, then takes its MD5. Since an MD5 is 128-bits and the API requires a 64-bit groupID, we truncate the MD5 to its first 8 bytes and use PERL's "unpack" function to interpret it as a signed quad, also known as a long long or a signed 64-bit integer ("int64"):

use Encode qw(encode_utf8);
use Digest::MD5 qw(md5);
my $md5 = md5(encode_utf8($iaClipUrl));
my $groupId = unpack('q', substr($md5, 0, 8)) . '';

This does mean that there is a slight mismatch between the groupIds used in the initial static dataset and the truncated MD5-based groupIds used by the live-appended dataset, but there should be minimal issues for this application.

Note carefully that we take the resulting 64-bit numeric datatype and convert it to a string. The reason for this is that int64's cannot be safely exchanged as numeric datatypes in JSON, so Google recommends that they be cast into strings.

How do we then append these new articles to our Timeseries Insights API? It turns out that the API makes this trivial!

Every minute, we compile a list of all of the new broadcasts that have become available in the last 30 minutes. We compute a groupID for each iaClipUrl using the code above and compile all of its OCR unigrams and bigrams into an array. We then create a separate JSON record for each second of airtime (each unique groupID has to be its own separate file) that is an array of "events" with the "groupId" and "eventTime" of the airtime second's airing, with an array of "dimensions" recording all of the OCR unigrams and bigrams from that second.

This means that for a one-hour broadcast, we will have 3,600 individual files on disk to be appended to the API. Each looks like this:

{ "events": [ { "groupId": "-9201979017926769511", "eventTime": "2022-02-09T20:07:11+00:00", "dimensions": [ {"name": "Station", "stringVal": "BBCNEWS"},{"name": "Ngram", "stringVal": "News"},{"name": "Ngram", "stringVal": "Rules"},{"name": "Ngram", "stringVal": "March Signals"},{"name": "Ngram", "stringVal": "Had Been"},{"name":"Ngram", "stringVal": "End Late"},{"name": "Ngram", "stringVal": "England Could"},{"name": "Ngram", "stringVal": "Rules Had"},{"name": "Ngram", "stringVal": "Cheshire"},{"name": "Ngram", "stringVal": "The End"},{"name": "Ngram", "stringVal": "Due To"},{"name": "Ngram", "stringVal": "In England"},{"name": "Ngram", "stringVal": "Expire At"},{"name": "Ngram", "stringVal": "Signals"},{"name": "Ngram", "stringVal": "Rules Current"},{"name": "Ngram", "stringVal": "Expire"},{"name": "Ngram", "stringVal": "Bbc News"},{"name": "Ngram", "stringVal": "Live Cheshire"},{"name": "Ngram", "stringVal": "Cheshire Bbc"},{"name": "Ngram", "stringVal": "England"},{"name": "Ngram", "stringVal": "Of March"},{"name": "Ngram", "stringVal": "Coronavirus"},{"name": "Ngram", "stringVal": "Late"},{"name": "Ngram", "stringVal": "Could"},{"name": "Ngram", "stringVal": "Late Feb"},{"name": "Ngram", "stringVal": "Feb "},{"name": "Ngram", "stringVal": "At The"},{"name": "Ngram", "stringVal": "Been"},{"name": "Ngram", "stringVal": "Covid Cases"},{"name": "Ngram", "stringVal": "Current"},{"name": "Ngram", "stringVal": "March"},{"name": "Ngram", "stringVal": "Current Rules"},{"name": "Ngram", "stringVal": "To Expire"},{"name": "Ngram", "stringVal": "Live"},{"name": "Ngram", "stringVal": "End Of"},{"name": "Ngram", "stringVal": "Cases"},{"name": "Ngram", "stringVal": "Covid"},{"name": "Ngram", "stringVal": "Could End"},{"name": "Ngram", "stringVal": "News Coronavirus"},{"name": "Ngram", "stringVal": "For Covid"},{"name": "Ngram", "stringVal": "Cases In"},{"name": "Ngram", "stringVal": "Been Due"} ] }] }

Next, we get our bearer token for authenticating to the API:

my $bearertoken = `gcloud auth application-default print-access-token`; $bearertoken=~s/\s+$//;

And then we use CURL to submit each to the API in parallel (this assumes the files are numbered sequentially from 0 to X):

seq 0 $NUMBERFILES | parallel -j 30 'curl -s -H \"Content-Type: application/json\" -H \"Authorization: Bearer $bearertoken\" -X POST -d \@$CACHE_TMP/{}.json https://timeseriesinsights.googleapis.com/v1/projects/[PROJECTID]/datasets/tvocr-bbc-production:appendEvents -o $CACHE_TMP/{}.json.ret'");

Note that the code above does not check the return code or response from the API. Thus, it does not automatically respond to rate limiting or API errors. A production-grade application would include appropriate handling for all of the various kinds of potential API error codes, but for the sake of simplicity, the pipeline presented in this demonstration avoids that error handling logic to focus just on the basic API interaction.

That's it! The API ingests each appended record and automatically updates the underlying dataset!

Finally, it is time to query the dataset!

The Timeseries Insights API offers a wealth of parameters that can be adjusted to fit any application and the specific settings that are best for any given application will vary depending on the characteristics of the underlying dataset. In this case, after extensive experimentation, the following query was found to yield the best overall results (searching the last rolling 12 hours against the previous rolling 72 hours):

time curl -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://timeseriesinsights.googleapis.com/v1/projects/[PROJECTID]/datasets/tvocr-cnn-production:query -d '{
  detectionTime: "2022-02-15T00:00:00Z",
  slicingParams: {
    dimensionNames: ["Ngram"]
  },
  timeseriesParams: {
    forecastHistory: "259200s",
    granularity: "43200s",
    minDensity: 0
  },
  forecastParams: {
    sensitivity: 0.25,
    noiseThreshold: 0,
    seasonalityHint: "DAILY"
  },
  returnNonAnomalies: false,
  returnTimeseries: true
}' > RESULTS.TXT; grep stringVal RESULTS.TXT | head -20;

You can see more technical detail about each of the parameters above in the API's documentation. The "detectionTime" and "granularity" parameters tell the API the period of time to search for anomalies. In this case, since the detectionTime is 00:00 UTC on February 15, 2022 and the granularity is 43200 seconds (12 hours), the API will search from 12:00 UTC on February 14, 2022 to 00:00 UTC on February 15, 2022 for anomalies. The "forecastHistory" parameter tells the API how far into the past (from the detectionTime) to look to determine a baseline against which to compare the search period when looking for anomalies. In this case we are telling the API to look 259,200 seconds (3 days) prior to the detectionTime. The API also uses the granularity to bin the historical comparison period, so in this case it will bin the previous three days into 12-hour-long periods.

The other parameters like minDensity (allows you to filter out rare entities that aren't mentioned very often), sensitivity (just how anomalous an entity must be to be flagged) and noiseThreshold (filter out jitter) are left at their default settings and allow you to fine-tune what the API sees as an "anomaly." We also set the seasonalityHint to suggest that there are likely macro-level daily trends. The returnTimeseries parameter tells the API that for each anomaly it should return the full 72-hour historical timeline that was used to evaluate it, making it easy for end applications to display its trending timeline. For debugging purposes, returnNonAnomalies can be set to true to tell the API to output every single entity it examined, even if it ultimately determined that it was not an anomaly, in order to understand the total universe of entities it is considering for each query.

In the case of the dataset above, we run the query above every 30 minutes, once for each of the four channels and output the results to GCS for you to download!

Below is a sample entry that shows what the API output looks like. Each file is wrapped with an outer "anomalyDetectionResult" that, in turn, contains an array of "anomalies". Each anomaly is a JSON object that contains a wealth of information about it. The "ForecastResult" API documentation provides more information about the meaning of each of these fields. In this case, the entity is the Olympics-related unigram "Skater" and the "history" array contains the timeline of how many times it was seen in each 12-hour period over the previous three days. You can immediately see that it was displayed onscreen for just 7 seconds on February 12th, then just 4 seconds on February 13th, but from midnight to noon on February 14th it appeared for 2,763 seconds of airtime, then from noon February 14th to midnight February 15th it appeared again for 2,758 seconds, marking a strong departure from the previous two days.

{
   "anomalyDetectionResult" : {
      "anomalies" : [
         {
            "dimensions" : [
               {
                  "stringVal" : "Skater",
                  "name" : "Ngram"
               }
            ],
            "status" : {},
            "result" : {
               "detectionPointForecastLowerBound" : 2758,
               "detectionPointForecastUpperBound" : 2768,
               "label" : "ANOMALY",
               "holdoutErrors" : {
                  "rmd" : 0.00180962721679334,
                  "mdape" : 0.00180962721679334
               },
               "trainingErrors" : {},
               "detectionPointForecast" : 2763,
               "forecastStats" : {
                  "density" : "80"
               },
               "forecast" : {
                  "point" : [
                     {
                        "value" : 2763,
                        "time" : "2022-02-15T00:00:00Z"
                     }
                  ]
               },
               "history" : {
                  "point" : [
                     {
                        "value" : 7,
                        "time" : "2022-02-12T12:00:00Z"
                     },
                     {
                        "time" : "2022-02-13T00:00:00Z",
                        "value" : 4
                     },
                     {
                        "value" : 2763,
                        "time" : "2022-02-14T00:00:00Z"
                     },
                     {
                        "time" : "2022-02-14T12:00:00Z",
                        "value" : 2758
                     }
                  ]
               }
            }
         },

We're tremendously excited to see what kinds of new applications you're able to create with this incredible new dataset! Look for announcements in the coming weeks as we integrate this data into some of our interfaces. Remember that this is an experimental dataset, so we may adjust some of the parameters over time and the API may add or remove fields in the JSON.