VGKG 2.0 Adds Histogram-Based Image Entropy Filtering And WEBP Support

The GDELT Visual Global Knowledge Graph (VGKG) performs a number of filtering steps on candidate images estimated by the document extraction system to be a meaningful part of the news article text in order to reduce the total number of images passed to Google's Cloud Vision API to just those most likely to yield high quality recognition results. As of today we've added three additional filtering passes and included the output of those algorithms in the JSON record for each image. We've also added support for WEBP images and now allow BMP images as well, given their popularity in certain parts of the world.

Each incoming image is resized to a maximum of 1500 x 1500 pixels if needed (smaller images are left as-is) and converted to JPEG format. A color histogram is computed for the resulting image, inventorying every distinct color found in the image and the number of pixels exhibiting that color. Images with too few distinct colors are discarded. We then calculate the percent of the image's pixels that are one of the top five most common colors and eliminate images that are dominated by just a few colors. In practice, genuine high detail images will not exhibit color dominance – even nighttime photographs will have substantial color range in the darkness and a focal object that occupies much of the frame. Images that do not meet these criteria do not typically possess sufficient detail to yield much in the way of visual recognition results. Finally, the standard normalized histogram-based image entropy score is calculated as -SUM((FREQ/TOT) * LOG2(FREQ/TOT)). As a purely histogram-based entropy score, it does not take into consideration any of the spatial characteristics of the image, but offers a useful filtering mechanism as a rough estimate of the visual complexity of the image. These are now recorded in the "ImageProperties" block of the image JSON as "NumberColors", "Top5ColorsPerc" and "HistogramEntropy" respectively.

We've also added WEBP support given the growing number of servers that no longer support automatic JPEG fallback. Historically when our VGKG crawlers encountered a WEBP image they would negotiate with the hosting server or CDN to request an alternate JPEG or PNG version. However, we've observed a slowly growing number of servers that do not offer WEBP alternate formats in select circumstances, so we've added full WEBP decoding. Animated WEBP images are converted to still images using the first frame of the animation. When we first launched the VGKG we assumed that the small number of sites relying on BMP images would largely switch over to JPEG or PNG formats as they modernized. Instead, we've seen that the BMP format has preserved in certain areas of the world and shows no signs of being replaced and our own research has demonstrated the impact the lack of BMP support has on our ability to process the visual landscape of those regions, so we've gone ahead and enabled BMP support for the VGKG as well.

Finally, the VGKG crawlers have been upgraded to our new GDELT 3.0 Global Fleet Architecture, which should substantially narrow the time it takes us to crawl and process global news imagery, allowing us to offer near-realtime visual indicators about global events.