More Lessons From GDELT 3.0's Crawler Architectural Reimagination

Last week we shared updates on some of the architectural changes we're making under the hood to GDELT 3.0's server and crawler architectures. As we continue our extensive benchmarking under real-world loads, we continue to evolve these new architectures with interesting lessons learned.

Our GDELT 2.0 core server architecture revolved around single-threaded asynchronous poll-based servers that worked extremely well with a simple and easily maintained architecture. However, as we discussed last week, the intense demands of GDELT 3.0 outpaced that architecture and we migrated to a multiprocess architecture for better scaling. We initially intended to use a classic shared socket architecture in which a single socket is bound to the incoming port and all threads compete to accept connections from this port. Under certain intense load conditions we still observed scaling issues with this architecture. Instead, the final iteration that has worked extremely well to date under even our heaviest loads is to use a truly multiprocess rather than threaded server architecture using SO_REUSEPORT to allow all of the independent processes to bind to the same port with their own independent sockets. A shared memory pool with zone-specific locks is used for minimal coordination around critical tasks. This architecture has the additional benefit of allowing us to dynamically scale the number of server processes up and down over time in realtime to respond to significant load changes without the fragmentation and scaling issues of threaded architectures. Most interestingly, since each server process is a true Unix process, it means we can actually perform rolling upgrades of our servers by slowly swapping out each process with an upgraded codebase and observing its behavior side-by-side in a live environment with the previous codebase or to insert instrumented processes into a live server array to diagnose a given behavior or issue.

On the crawler front, each single-core VM now houses an array of independent processes, each of which acts as a crawler "thread" using blocking IO and a minimal shared memory pool for coordination. A pool of infrastructure processes interface with different fleetwide coordination servers and GCS and keep the crawlers fed from frontier caches while streaming their results to the relevant ingest fabrics. The use of independent processes rather than threads allows us to dynamically tune the number of active crawling processes and interface processes live in ways that would be difficult under a threaded model. The most powerful benefit of this architecture, however, is that since crawling threads are now independent processes coordinated through shared memory, those processes do not need to be related. Thus, an image crawler, a difference crawler, an article crawler and multiple other types of crawlers can all run concurrently on the same VM and coordinate their outbound traffic to enforce a strict rate limiting by domain and server IP set and share infrastructure processes. It also means that any diagnostic, sidecar or monitoring and tuning applications can simply hook directly to these shared memory segments to seamlessly observe and interact with the live traffic of a crawler VM, while upgraded codebases and instrumentation processes can be run alongside production processes.