We're excited to announce that as of late yesterday afternoon, Version 1.0 of the GDELT Visual Knowledge Graph (VGKG) is now available, bringing with it many new features and enhancements from the Alpha release! In particular, the Google Cloud Vision API has added a number of new features and enhancements, including a new set of descriptors that characterize the color distribution of the image.
The raw JSON output of the Cloud Vision API offers many additional fields, including the pixel locations of all facial landmarks from each recognized human face in each image and will be expanding over time to incorporate a wealth of forthcoming capabilities and expansions. To ensure access to the latest capabilities, we are now including the raw JSON output of the Cloud Vision API with each VGKG record, allowing you to access all of these additional fields. NOTE that the specific fields and format of the JSON field may change over time since the Cloud Vision API is still in preview release.
Version 1.0 of the VGKG offers vastly enhanced OCR output for non-English languages and non-Latin charactersets. In the globalized world of journalism, an article in a French language newspaper reporting on events in Syria will summarize events in French (the text portion will be in French), but the accompanying image might be of a local sympathy protest in Paris with French signs in the background, or an image of the latest ISIS atrocities from Syria and contain Arabic signs and material. Images are complex, since they reflect the actual linguistic reality of the region being reported on, even while the textual content of a news article translates events into the local language and perspectives of a news outlet's audience. The Cloud Vision API performs best when it has at least some domain knowledge of the language(s) to expect in an image. Towards that end, we make a simple assumption that the image featured in a given news article will most likely feature text in in one of three sets of languages: English, the language the article was written in, or the language(s) most commonly used in the location focused on in the article. We therefore use the V2Locations GKG field to compute the primary country discussed most commonly in the article and compile a list of the primary language(s) most commonly used in that area, combine them with the language of the article itself, along with English, and hand these as language hints to the Vision API. You will note a dramatic increase in recognition of Arabic, Farsi, Thai and other languages in the VGKG 1.0 with this new pipeline.
The Vision API performs best with images within certain resolution ranges. On the web you can have very small thumbnail images that will yield poor recognition quality, along with massive images that are much larger than the Vision API needs for effective recognition. We now automatically filter out images too small to yield quality results and automatically resize images that are too large using Lanczos resampling.
CAVEATS & DISCLAIMER: EARLY RELEASE
Unlike the other text-based GDELT feeds, please note that this feed is extremely experimental and is being released as an early release. This means that the behavior and supported features of the feed may change at any moment, or it may go away entirely. If ingesting these feeds into an automated workflow, your scripts should perform extensive error checking to ensure that they are able to cope with any changes to the underlying data format. Please check back here on the GDELT Blog on a regular basis for any updates or documentation changes to the format and its behavior.
The use of deep learning algorithms for image recognition is still a highly experimental area of active research and the application use case presented by GDELT (attempting to recognize and catalog arbitrary images from almost every corner of the planet on almost every topic imaginable at realtime speed) represents one of the most difficult and wide-ranging applications of such technology today.
What this means is that you will almost certainly encounter a certain level of error in the categorizations and other information computed about each image. Remember that all tags are applied 100% automatically with NO HUMAN INTERVENTION and mistaken categorizations or other tags represent computer algorithm errors, NOT editorial statements. Remember that computer image recognition at these scales is still in its relative infancy and the underlying algorithms are encountering large amounts of imagery utterly unlike anything they've ever seen before, so this data stream is really pushing the boundaries of current deep learning recognition and will make mistakes. Please email email@example.com with the image URL and mistaken metadata fields if you find any particularly significant errors so that the system can be constantly improved and refined.
ACCESSING THE DATA
There are two primary mechanisms for accessing the VGKG data stream: CSV files and Google BigQuery, both of which are updated every 15 minutes:
- Google BigQuery. Similar to the main GKG table, we also populate a publically accessible table housed in Google BigQuery to make it possible to interactively query and analyze the computed metadata and join it against the main GKG and EVENT tables. NOTE that the "LangHints" and "RawJSON" fields are only available beginning the evening of February 12, 2016. The table is gdelt-bq:gdeltv2.cloudvision.
- Raw CSV Files. Similar to the GKG files, you can access the VGKG data stream via simple CSV files updated every 15 minutes. These files are tab-delimited and gzip compressed. The latest CSV files are released somewhere between 0-5 minutes, 15-20 minutes, 30-35 minutes, and 40-45 minutes after the hour (about 5 minutes after the GKG file is released for each 15 minute increment). To determine when the latest VGKG files are available, check the contents of http://data.gdeltproject.org/gdeltv2_cloudvision/lastupdate.txt, which is updated when the latest update files become available. You could poll this file every 5 minutes to ensure you always download the latest files. NOTE that files are named as "imagetags.csv.gz" and "translation.imagetags.csv.gz" from "20151228033000" until the evening of February 12, 2016, when starting with "20160212213000.imagetagsv1.csv.gz" and "20160212211500.translation.imagetagsv1.csv.gz" they have "v1" added to the filename.
- Documentation. You can read more about the documentation for the Alpha release, which details the delimited format used for the parsed fields, or beginning with the evening of February 13, 2016 you can simply parse the RawJSON field.