VGKG Crawler Upgrade Complete Including New High Performance URL Cache

We're excited to announce that the complete reimagination of GDELT's server-crawler architecture has been put into production for the VGKG and has run for the past week with phenomenal results, fixing every single one of the limitations we faced with our previous architectures. As part of this upgrade we have also introduced a new extremely high performance URL caching architecture for our VGKG crawlers. Given that the majority of images we encounter each day in worldwide online news are novel, it doesn't make sense to incorporate a true worldwide distributed caching fabric that maintains a strict record of every single URL ever encountered to ensure we never crawl it again. Most importantly, given that even a small resizing of a few pixels or recompression or watermarking can so alter an image's pixel that fast comparisons fail, it doesn't make sense to implement such robust caching.

Instead, historically, we used a rolling 24-hour cache with reordering to prevent us from recrawling an image twice if we encountered it multiple times in different articles, under the assumption that most duplicate images would duplicate images seen in the last 24 hours. In essence, for each article's images, we would compare them with all images we had seen in the last 24 hours and duplicate images were always moved to the front of the queue so they would not expire out. This worked well in most cases, but around 10-15% of images are cyclical duplicates that have more than 24 hours between appearances. Typical examples include stock imagery associated with particular columns that might appear three times a week, twice a week or once a week or every other week. For example, a column that appears every Friday or every Tues and Thursday. Since there are more than 24 hours between these columns, the associated stock imagery would expire out of our cache.

To address this, we are now using a rolling 14-day cache which we may extend up to 30 days, that uses an extremely high performance cache system that allows it to compare two weeks of URLs in just a small fraction of the time it formerly took to compare against a single 24 hour period of URLs.