The GDELT Project

Peering Into The Visual Landscape Of Half A Billion News Images With Google’s Cloud Inference API

At its core, the GDELT Project is about helping us understand and make sense of the rapidly evolving world around us. From estimating global risk in realtime to letting us see the world through the eyes of others, GDELT is about bringing the world together, breaking down linguistic and geographic barriers to create a global open dataset that captures the realtime heartbeat of the planet we call home.

Today GDELT’s core datasets span more than 3.2 trillion datapoints covering more than 200 years of human history.

Powering this immense realtime catalog of global society’s events, narratives, dreams and fears is a vast archive of algorithms from simple pattern matching and bag of words tools to massive statistical models to the most advanced deep learning approaches currently available, among them Google’s Cloud Vision API.

Since GDELT began using the Cloud Vision API in December 2015 to peer into the visual landscape of the world’s news imagery, it has processed nearly half a billion images totaling almost a quarter trillion pixels from news outlets in almost every country. For each image, more than 10,000 objects and activities are recognized, more than a million topics identified from their captions, geographic location estimated, logos recognized, text in 56 languages extracted and even the number of human faces and how happy or sad each appears is all compiled from the image. In total, the Cloud Vision API has cataloged more than 321 billion datapoints totaling 8.9TB of JSON describing all of that imagery.

This extraordinarily rich dataset offers us unprecedented opportunities to understand the visual landscape of the global news media, to quantitatively explore and understand the visual lens through which we “see” the world.

What if we took a single month of this data, spanning 10 million images from April 2018 and asked a simple, but powerful question: what are the defining visual characteristics of a “violent” image?

The Cloud Vision API allows images to be run through Google’s SafeSearch algorithm to flag images it believes depict some level of violence, making it relatively simple for us to select the images it believes contained some level of violence and ask for the most common visual elements it found in those images.

Given that the entire dataset is loaded into Google BigQuery, we can ask this question with a single line of Legacy SQL using some of the fields we’ve preextracted:

select Label, count(1) as cnt from (
 SELECT REGEXP_REPLACE(SPLIT(Labels, '<RECORD>'), r'<FIELD>.*', '') Label FROM [gdelt-bq:gdeltv2.cloudvision_partitioned] where Labels is not null
 and SafeSearch like '1<FIELD>%' or SafeSearch like '2<FIELD>%' and _PARTITIONTIME >= "2018-04-01 00:00:00" AND _PARTITIONTIME < "2018-05-01 00:00:00"
 ) group by Label order by cnt desc

In just 14 seconds BigQuery processes 162GB of data and offers us a histogram of the top 6,432 topics identified by Cloud Vision API in images that Google’s SafeSearch algorithm identified as either “Likely” or “Very Likely” to depict “violence” of some kind (only the first 15 are shown below).

Label Matches
geological phenomenon 69651
person 58050
vehicle 53773
soldier 50267
mouth 48253
head 40932
nose 39986
food 39459
troop 37028
soil 36223
military 31128
jaw 31112
people 30771
hand 30415
skin 30104

At first glance this list seems quite reasonable. The most common label, “Geological Phenomenon” reflects images of large-scale destruction, while the military and vehicle categories make sense to be associated with violence. The prevalence of specific body part labels reflects their morbid presence in images depicting the aftermath of violent activity.

Looking more closely, however, the list seems far less detailed than one might expect and the categories like people and vehicles are also likely to appear heavily in images unrelated to violence. In short, while this list gives us the top labels associated with violent images, those labels are also likely highly associated with non-violent images as well.

To examine this further, let’s use this same approach to compare two categories: images that were found somewhere on the web with a caption of Syria and images captioned Donald Trump (the API uses the textual captions of images for this search since it does not perform facial recognition), using the Cloud Vision API’s WebEntities feature.

Again, through the magic of BigQuery, we are able to parse 381GB of JSON in just 13 seconds to get the top 15 web entities most commonly found in images whose captions mention Donald Trump:

select Label, count(1) as cnt from (
 SELECT REGEXP_REPLACE(SPLIT(Labels, '<RECORD>'), r'<FIELD>.*', '') Label FROM [gdelt-bq:gdeltv2.cloudvision_partitioned] where Labels is not null
 and JSON_EXTRACT(RawJSON, "$.responses[0].webDetection") like '%Donald_Trump%' and _PARTITIONTIME >= "2018-04-01 00:00:00" AND _PARTITIONTIME < "2018-05-01 00:00:00"
 ) group by Label order by cnt desc limit 15

 

WebEntity Matches
professional 201797
profession 187714
official 169055
speech 102720
entrepreneur 101606
businessperson 95872
diplomat 90430
spokesperson 86779
public speaking 79781
suit 69061
party leader 61339
business executive 60005
speaker 58762
public relations 54802
product 51995

And repeat the process for all of the images whose captions mentioned Syria:

select Label, count(1) as cnt from (
 SELECT REGEXP_REPLACE(SPLIT(Labels, '<RECORD>'), r'<FIELD>.*', '') Label FROM [gdelt-bq:gdeltv2.cloudvision_partitioned] where Labels is not null
 and JSON_EXTRACT(RawJSON, "$.responses[0].webDetection") like '%Syria%' and _PARTITIONTIME >= "2018-04-01 00:00:00" AND _PARTITIONTIME < "2018-05-01 00:00:00"
 ) group by Label order by cnt desc limit 15

 

WebEntity Matches
profession 94075
professional 88161
official 75289
speech 47894
diplomat 47750
entrepreneur 46285
businessperson 44417
spokesperson 42742
public speaking 38308
vehicle 37151
product 34145
speaker 31967
sky 31272
party leader 28672
staff 25609

With the exception of a few categories in different order, the two lists are nearly indistinguishable.

How could this be? The answer is that like any cataloging system, certain topics are considerably more common than others. Those common labels show up at the top of every histogram, regardless of the query being investigated, meaning that no matter what query we look at we will get pretty much the same results.

What we need is a more powerful way of asking what is “unique” about each topic. In other words, which topics are associated more with our search term than with any other term, rather than simply which terms appear the most.

Google’s Cloud Inference API

Enter the new Google Cloud Inference API. The Cloud Inference API is an incredibly powerful analytics platform that allows you to ingest massive datasets and then run realtime queries to ask questions like “what values are most uniquely associated with my search term” or use time series analysis to ask “for users that buy X, what do they typically buy a day later, a week later or a month later?” Cloud Inference API is an extremely powerful platform and we will just scratch the surface of its capabilities here.

Cloud Inference API uses a simple JSON structure that groups datapoints together via GroupIDs that can represent everything from browsing sessions to documents. In our case, we will treat each appearance of an image in a document as a “group” and assign it a unique GroupID. Thus, a document with a single image will be treated as a single group, while a document containing 20 images will be treated as 20 separate groups. Combing documents and images like this offers us the ability to explore questions such as what kinds of images are found in news articles whose text emphasizes a particular topic.

To get started, we wrote some quick PERL scripts to parse the GDELT GKG and VGKG CSV file formats, merge them together and write the final JSON format needed by the Cloud Inference API. This is fairly straightforward and primarily just requires you to think about how you wish to group your data and which fields to include for analysis.

We’ve made the full resulting JSON file with its 675,815,955 rows (each value is recorded in a separate row) available at http://data.gdeltproject.org/blog/2018-google-inference-api/Data/MASTER-GOOGINFER-VGKG.201804.json, but be advised that it is 83GB in size, so to create a new Inference table you can just reference it as-is in GCS using the code we will present in a moment. We’ve also made a sample version of the file that contains just the first 25 million rows and is only 3.1GB for those that want to get started with a smaller dataset to experiment. The sample dataset is available at http://data.gdeltproject.org/blog/2018-google-inference-api/Data/MASTER-GOOGINFER-VGKG.201804.25MSAMPLE.json.

Then, sign up for the Cloud Inference API and follow the instructions to get it set up for your account. Note that for those familiar with the previous Google APIs for machine learning, the Cloud Inference API uses Google’s newer authentication flow.

Once the Cloud Inference API has been enabled, create the new table and load the data into it. You can use the code below to load the April 2018 GKG+VGKG dataset we’ve created. Note that the full dataset is 83GB, so you can also use our smaller sample dataset that is only 5GB by replacing “MASTER-GOOGINFER-VGKG.201804.json” with “MASTER-GOOGINFER-VGKG.201804.25MSAMPLE.json”.

curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://infer.googleapis.com/v1/projects/[YOURPROJECTIDHERE]/datasets -d '{
 "datasets": [
 {
  "name":"vgkg201804",
  "dataNames": [
    "PageURL",
    "PageDomain",
    "PageCountry",
    "PageLanguage",
    "PageTextTheme",
    "PageTextGeo",
    "ImageURL",
    "ImagePopularityRawScore",
    "ImagePopularity",
    "ImageSafeSearch",
    "ImageLabel",
    "ImageWebEntity",
    "ImageWebEntityBestGuessLabel",
    "ImageGeoLandmark",
    "ImageLogo",
    "ImageFaceCount",
    "ImageFaceToneAvg",
    "ImageFaceToneHas"
  ],
  "dataSources": [
    { "uri":"gs://data.gdeltproject.org/blog/2018-google-inference-api/Data/MASTER-GOOGINFER-VGKG.201804.json " }
  ]
 }
 ]
}'

At this point the Cloud Inference API will create the new table and load all of the data into it. This step could take an hour or more, so follow the instructions in the Cloud Inference API documentation for tracking its progress and determining when it has finished.

Once the dataset has finished loading, we can start querying it. Behind the scenes, the Cloud Inference API has already created all of the necessary indexes to be able to offer query response times of just 1-2 seconds or less.

Returning to our original question of what image labels are most closely associated with violent images, the Cloud Inference API’s query syntax makes this trivial:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTIDHERE]/datasets/vgkg201804:query \
  -d'{
  "name": "vgkg201804v2",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "ImageSafeSearch",
      "value": "Violence"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "ImageLabel",
      "maxResultEntries": 10,
    }
  }]
}'

This query asks the Cloud Inference API to find all of the images that were flagged by Google’s SafeSearch algorithm as depicting violence and then asks it to determine the most relevant ImageLabel values associated with those images. In other words, those values that are more associated with violent images than non-violent images. The API’s JSON response, seen below, shows the top labels and information about their distributions.

{
  "results": [
    {
      "distributions": [
        {
          "dataName": "ImageLabel",
          "matchedGroupCount": "1987",
          "totalGroupCount": "7616785",
          "entries": [
            {
              "value": "ImageLabel=blood",
              "score": 41.451675,
              "matchedGroupCount": "78",
              "totalGroupCount": "276"
            },
            {
              "value": "ImageLabel=flesh",
              "score": 38.00879,
              "matchedGroupCount": "340",
              "totalGroupCount": "2999"
            },
            {
              "value": "ImageLabel=disease",
              "score": 35.181164,
              "matchedGroupCount": "50",
              "totalGroupCount": "171"
            },
            {
              "value": "ImageLabel=scar",
              "score": 27.231394,
              "matchedGroupCount": "40",
              "totalGroupCount": "171"
            },
            {
              "value": "ImageLabel=slaughterhouse",
              "score": 20.778246,
              "matchedGroupCount": "72",
              "totalGroupCount": "652"
            },
            {
              "value": "ImageLabel=pruritus",
              "score": 17.188293,
              "matchedGroupCount": "15",
              "totalGroupCount": "60"
            },
            {
              "value": "ImageLabel=pig_roast",
              "score": 14.234229,
              "matchedGroupCount": "8",
              "totalGroupCount": "23"
            },
            {
              "value": "ImageLabel=zombie",
              "score": 11.156274,
              "matchedGroupCount": "35",
              "totalGroupCount": "490"
            },
            {
              "value": "ImageLabel=butcher",
              "score": 10.785749,
              "matchedGroupCount": "36",
              "totalGroupCount": "539"
            },
            {
              "value": "ImageLabel=lechon",
              "score": 9.225009,
              "matchedGroupCount": "5",
              "totalGroupCount": "15"
            }
          ]
        }
      ]
    }
  ]
}

In short, the top labels are blood, flesh, disease, scar, slaughterhouse, pruritus (a skin ailment), pig roast, zombie, butcher and lechon – labels that seem far more specific to violent images. Unlike our earlier tally of the most common labels found in violent images, the Cloud Inference API gives us the labels that are most strongly associated with violent images compared with non-violent images, even if they are relatively rare overall in the corpus as a whole. In other words, those labels that are characteristic of violent imagery.

What if we modify the query to search across the GKG and VGKG data in a single search? This time we will ask the Cloud Inference API to find the image labels most closely associated with images appearing in articles whose textual contents mention a location somewhere in Syria.

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTIDHERE]/datasets/vgkg201804:query \
  -d'{
  "name": "vgkg201804v2",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "PageTextGeo",
      "value": "Syria"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "ImageLabel",
      "maxResultEntries": 10,
    }
  }]
}'

In just two seconds the Cloud Inference API returns the results, listing earthquake, rubble, disaster, demolition, Toyota Land Cruiser, military, diplomat, military organization, ruins and army as the most closely associated labels.

The inclusion of Toyota Land Cruiser in that list, which historically was the vehicle of choice of the Islamic State’s ground forces, is remarkable and stands testament both to the power of the Cloud Vision API to identify even specific vehicle makes from global news imagery and to the Cloud Inference API’s ability to tease out its association with visual coverage of the Syrian conflict.

With just a single query and less than 2 seconds of response time we were able to use the Cloud Inference API to blend the textual GKG and the visual VGKG to search for articles that mentioned a Syrian location in their text and to find the underlying patterns in their images.

Returning to our early pair of queries for the labels most associated with images captioned Donald Trump and captioned Syria, respectively, we can use the following two queries. The first for Donald Trump:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTIDHERE]/datasets/vgkg201804:query \
  -d'{
  "name": "vgkg201804v2",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "ImageWebEntity",
      "value": "Donald_Trump"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "ImageLabel",
      "maxResultEntries": 10,
    }
  }]
}'

And the second for Syria:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTIDHERE]/datasets/vgkg201804:query \
  -d'{
  "name": "vgkg201804v2",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "ImageWebEntity",
      "value": "Syria"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "ImageLabel",
      "maxResultEntries": 10,
    }
  }]
}'

For Donald Trump-captioned images, the Cloud Inference API tells us the most closely associated labels are speech, official, party leader, politician, diplomat, spokesperson, speaker, executive officer and public speaking.

In contrast to our earlier results where the Donald Trump and Syria images returned nearly identical sets of labels, using the Cloud Inference API the Syria-captioned images return a starkly different set of labels: earthquake, rubble, Toyota Land Cruiser, Toyota Sequoia, disaster, bomb, ruins, demolition and military organization.

Finally, we are getting results that really distinguish our searches!

Change a single parameter in the query and we can ask for the top web entities for each image (the topical labels computed from the image caption text) instead of the visual labels for Syria-captioned images:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTIDHERE]/datasets/vgkg201804:query \
  -d'{
  "name": "vgkg201804v2",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "ImageWebEntity",
      "value": "Syria"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "ImageWebEntity",
      "maxResultEntries": 10,
    }
  }]
}'

The results once again make a lot of sense: Donald Trump, President of the United States, White House, United States, Presidency of Donald Trump, Republican Party, Special Counsel Investigation, Melania Trump, Barack Obama and James Comey. Again, these topics are extracted from the textual captions used to describe the images, rather than visual recognition from the image itself, hence their greater specificity.

We can also ask more unusual questions, such as asking which countries’ news outlets are most associated with publishing imagery depicting protest activity:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTIDHERE]/datasets/vgkg201804:query \
  -d'{
  "name": "vgkg201804v2",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "ImageLabel",
      "value": "Protest"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "PageCountry",
      "maxResultEntries": 10,
    }
  }]
}'

The resulting country list offers an intriguing starting point for those interested in exploring worldwide protest activity in April 2018 and those interested in global differences in the depictions of protest activity in the press: United States, France, Indonesia, Mexico, Spain, Israel, United Kingdom, Bangladesh, Turkey and Italy.

What about using the Cloud Inference API to explore what makes for “happy” images? The Cloud Vision API counts up the number of easily seen human faces in each image and estimates whether each face depicts one of four emotions: joy, sorrow, anger and surprise. Using this information, we can ask the Cloud Inference API what visual elements are most closely associated with images containing a majority happy faces:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTIDHERE]/datasets/vgkg201804:query \
  -d'{
  "name": "vgkg201804v2",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "ImageFaceToneAvg",
      "value": "MostlyHappy"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "ImageLabel",
      "maxResultEntries": 10,
    }
  }]
}'

In an interesting commentary on how the world’s news outlets choose the imagery they publish, it seems that the most strongly associated labels are: smile, socialite, girl, fun, professional, facial expression, long hair, friendship, brown hair and beauty. Using web entities instead of labels, the list becomes socialite, girl, smile, hair, celebrity, facial expression, friendship, long hair, public relations and blond hair.

If you think about it, these labels make a lot of sense from the standpoint of capturing the visual elements and textual captions of images that depict a majority smiling human faces. While routine general news images may contain people smiling, they tend to cover a vast array of topics that are also found in images where people aren’t smiling. After all, for every beaming politician at a podium announcing a legislative success, there is another somber-faced politician making a public apology. In contrast, images of celebrities and beauty product images tend to uniquely focus on smiling people more so than any other topic, a finding borne out by a manual post review of a random selection of images flagged as containing happy faces and those that were not.

To put it another way, images of celebrities in the mainstream press tend to uniquely focus on glamorous smiling photos of them at press events.

In contrast, images depicting angry faces tend to be almost exclusively sports related: rugby player, championship, international rules football, rugby league, team sport, player, soccer player, tournament, sports and football player. This is because the sports imagery that is published in news outlets tends to depict key moments of exertion on the field, such as player kicking a ball, in which their facial expression at that instant offers a textbook example of anger.

In short, sports imagery seems to prefer images of exertion rather than elation.

Finally, we can use the time series capabilities of the Inference API to ask some fascinating questions about the temporal patterns of visual depiction in the news media.

For example, do images of Donald Trump tend to be followed by more images of him published shortly thereafter or do news outlets run a single image of him and then switch topics? One might imagine that as an international head of state, any burst of news about the president is likely to spark a flurry of additional coverage over the following hours, rather than a single article mentioning a story about him that is not followed by any additional coverage.

To test this, the query below searches for all images captioned Donald Trump and then asks what topics are most closely associated with subsequent images that appear within the next 30 minutes:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTIDHERE]/datasets/vgkg201804:query \
  -d'{
  "name": "vgkg201804v2",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "ImageWebEntity",
      "value": "Donald_Trump"
      }
    },
    "distributionConfigs": {
      "bgprobExp": 0.7,
      "dataName": "ImageWebEntity",
      "maxResultEntries": 10,
      "maxAfterTimespan": "1800s"
    }
  }]
}'

Notice that all it took to add a time series dimension to the query was the addition of the “maxAfterTimespan” parameter, making it trivial to convert any static query into a time series one!

The query above shows that images captioned as Donald Trump are typically followed in the next 30 minutes by images captioned Donald Trump, President of the United States, Barack Obama, White House, Melania Trump, United States, Trump’s The Art of the Deal, Robert Mueller, Stephen K. Bannon and Ivanka Trump.

In short, any coverage of the president tends to be followed by at least a 30 minute burst of additional coverage of him. No surprise there.

What if we want to know which days had particularly notable bursts of imagery of Donald Trump? This query does precisely that, using the special value of “ted” for the “dataName” field, which returns days since the epoch:

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  https://infer.googleapis.com/v1/projects/[YOURPROJECTIDHERE]/datasets/vgkg201804:query \
  -d'{
  "name": "vgkg201804v2",
  "queries": [{
    "query": {
      "type": 1,
      "term": {
      "name": "ImageWebEntity",
      "value": "Donald_Trump"
      }
    },
    "distributionConfigs": {
      “dataName”:"ted",
      "maxResultEntries": 10
    }
  }]
}'

The results give us the days where there were usually significant bursts of imagery of Donald Trump, including on April 25th (French President Macron’s visit and discussions on Iran and North Korea) and April 14th (US precision strikes on Syria).

Putting this all together, we’ve only just scratched the surface here of what’s possible with Google’s new Cloud Inference API, but hopefully we’ve demonstrated its powerful applications to understanding news content and other kinds of documents and how it can be paired with the Cloud Vision API to allow us to tease out key patterns in the visual news narratives through which we see the world around us. We can’t wait to see what you come up with!

For those that want to really deep dive into this dataset and explore longitudinal change, we've created monthly datasets from January 2017 to August 2018, giving you more than a year and a half of data. All of the files combined total more than 2.2TB, so you're unlikely to want to load all of them into your Inference API dataset, but those interested in more advanced explorations might wish to load a few of them into a dataset to understand how the visual focal points of the news change over larger time horizons:

 

I’d like to thank Google Cloud and especially the Cloud Inference API, Cloud Vision API and BigQuery teams and Emanuel Taropa, Zhen Ma and Vincent Hoon and the rest of the Cloud Inference API team for their help in creating these examples.