One of the most incredible aspects of working in Google's cloud environment is that no matter what you need, chances are Google likely has already built a tool to do it. As we role out GDELT 3.0 across our backend infrastructure, we've been particularly interested in increasing not only our realtime visibility into GDELT's myriad moving parts, but also the ability to combine realtime insights with a 30,000 foot view to understand macro-level trends, to trace rare and complex issues and to gain a global perspective on things like dynamic server content optimization and intelligent user agent targeting.
At this point GDELT's Visual Global Knowledge Graph is the first system to have transitioned over to GDELT 3.0 and over the past week we have been busy adding a global logging system that allows our crawler infrastructure to report back the realtime trends it sees as it monitors global news imagery. Our crawlers have always generated runtime logs, but now these logs are streamed in realtime to a central logging server that writes them into an internal BigQuery table every 15 minutes. This allows us to rapidly diagnose emerging issues in realtime, but also to look back historically to see where an issue was first seen and to diagnose its scope or assess whether it is an isolated issue or indicative of a broader shift or issue.
In fact, just this past weekend we used this new infrastructure to examine how global news media CDNs handle the WEBP format and identified specific CDN networks that use intelligent user agent targeting to replace JPEG images with WEBP images on-the-fly for WEBP-compatible user agents. This means that if you fetch "image.jpg" using a user agent suggestive of WEBP support, these CDNs will replace the JPEG data with a WEBP image and "image.jpg" will actually contain a WEBP image rather than JPEG data, while if you fetch it from a user agent that is not suggestive of WEBP support, you will get ordinary JPEG data back. Our crawlers actually ignore file extensions and instead perform multiple stages of content inspection, including unpacking the image into memory and fully validating it. Thus, having WEBP data with a ".jpg" file extension isn't a problem for us, but it was something we wanted to understand better, especially as we work with collaborators like the Internet Archive, to help us model the behavior of the global online news ecosystem. With all of our crawler log data being ingested into BigQuery now, we were able, with just a few SQL queries, to uncover the specific hosts exhibiting this behavior and the circumstances under which we were seeing the WEBP-as-JPEG issue and quickly trace it to intelligent user agent targeting.
While typically most applications would use Stackdriver for this kind of centralized logging, our existing heavy integration with BigQuery made using it as a base a natural choice. We are tremendously excited to see how this realtime-meets-historical analytics in BigQuery allows us to tackle challenges we've never been able to even contemplate before.
In similar fashion, we've been exploring various ways that our crawlers can stream data back to a central analysis system for short term ingest for diagnosis and trend analysis. For example, when we encounter a JPEG image that causes libjpeg to error or an image that passes preliminary content validation but fails more extensive verification tests or, most interestingly, images that one library considers valid but another treats as invalid, we want to be able to run automated diagnoses on these images to learn more about them. Given that the images we see in the wild can weigh in the tens of megabytes and are pure binary data and at times we can generate 50MB/s+ of such diagnostic data per crawler, we needed a system that could reliably ingest streaming binary data effectively infinitely. Our crawlers are designed to be extremely lightweight, typically featuring just a single core and a 10GB SSD boot disk. The small disk on each crawler effectively negates the notion of caching files locally on the crawler both due to limited space and the limited IO bandwidth of a 10GB disk (300 IOPS / 4.8MB/s). Thus, we needed an infinite remote streaming solution that could scale linearly with the number of crawlers.
Once again, it turns out that Google has a tool already in place for this exact use case called Google Cloud Storage (GCS) Streaming Transfers. Essentially you just open a piped handle to "gsutil cp – gs://yourbucket/yourfile" and live stream directly to a GCS file – everything else just happens magically. In our case we actually pipe through "mbuffer" to smooth our intensely bursty IO profile and generate larger write blocks for GCS and thus our final piping looks like "| mbuffer -q -m 150M | gsutil -q cp – gs://bucket/file". Thus far this new approach is working fantastically and for all intents and purposes gives us infinite realtime streaming capability direct to storage with full linear scaling and no IO management. Each of our crawlers simply opens a piped handle at startup and live streams to it as it proceeds, with the data streaming out to GCS as it goes. A larger central cluster can then read this data from GCS at its leisure and perform advanced diagnoses and analytics, without worrying about keeping up with streaming buffers or filling up a local disk or other such issues – essentially GCS becomes a giant infinite streaming buffer in the cloud that connects our frontier crawlers with backend analytics. Those crawlers at the same time run those same images through Cloud Vision, meaning our crawlers access an image, verify and screen it for size and visual complexity, dispatch it to Cloud Vision and potentially through GCS Streaming Transfer.
In our case we wrap each object inside an XML container and live stream a separate TOC file with the binary offsets of each object's metadata and its data blob in the main file. Each object's metadata also includes the MD5 of the binary blob since we already calculate that as part of our VGKG record. In the ordinary case the XML is skipped over and the TOC is used to directly navigate the file, but by using an XML container in the core storage file we are able to recover files in case of error by simply traversing the XML structure to reconstruct the TOC, which gives us critical additional format safety. By using this hybrid model we increase the file size only slightly, gain full format recoverability and maintain full live streaming ability with minimal overhead.
Using this exact same model you could essentially build yourself a global web archiving system like the Internet Archive's crawlers simply by running a cluster of tiny one-core VMs performing frontier crawling and live streaming direct to GCS.
Once again, the power of the cloud means having an incredible library of building blocks at your fingertips that you can just plug-and-play to build incredibly powerful and complex systems with just a few lines of code!