The Challenges Of Single-Threaded Socket Polling For Data-Intensive Crawlers

GDELT's globally distributed crawler fleets have historically used single-core VMs to absolutely maximize the aggregate number of distinct IP addresses and resources associated with our crawlers as well as to provide enhanced failover capability. In particular, public clouds tend to prioritize physical distribution of an app, enforcing per-VM IO limits that mean multiple VMs can often achieve higher performance than a single VM with the same specifications. In this way, 64 single-core VMs can achieve dramatically higher networking performance than a single 64-core system when performing certain kinds of activity. Moreover, a large fleet of single-core VMs distributed across many data centers can be designed to be highly fault tolerant and self-healing.

Since our VMs are single-core systems (technically they are actually a hyperthread rather than a physical core) and since the resulting tasks we must perform on each article/image/etc requires substantial CPU that overlaps well with network latency, our crawling architecture has historically revolved around single-threaded crawlers using IO polling. For example, while a crawler is waiting on a number of in-flight image or article requests, at any given moment at least one of those requests has completed and it can process it. The CPU required to process that request typically overlaps well with the completion rate of outstanding requests, meaning the crawlers are never stalling. Each VM runs one of each of several different crawler types (images, articles, difference monitoring, etc) depending on its capabilities, geographic location and current frontier queue pressure, further overlapping the computational load on the VM.

In practice, this leads crawler VMs to be fully utilized with sufficient headroom to handle resource peaks. Yet, as we have continued to move more and more of our computation to the edge (for example adding GEMG metadata extraction to article crawlers and enhanced format, entropy and Unicode verification and repair checks to our image crawlers), the per-object CPU requirements have steadily increased. We've recently reached a point where the CPU time required to handle a single object can lead to data loss in certain circumstances. Specifically, the actual CPU time required to process a single article/image/etc is non-deterministic and can vary wildly based on the internal complexity of the object. For example, an HTML document can have deep internal structure that requires considerable time to render into a visual layout for article extraction, while an image can have characteristics that take certain calculations dramatically longer to complete. We typically can't detect these edge cases a-priori or even if we can prefilter to estimate the probability of their occurrence, it may be too late to throttle down incoming traffic due to the amount of data already in flight. Even after extensive tuning of the kernel network stack we still encounter edge cases where a problematic piece of content leads us to lose incoming data due to buffer overruns. No matter how large you make your buffers or set your backlog queues you will still encounter problems if your application can't read the data fast enough and extremely large buffers create other challenges.

Single-threaded polling architectures are wonderfully simplistic and highly efficient when properly built, but the growing edge processing we are performing means we have eclipsed the limits of such architectures. It is important to recognize too that by placing a lot of computational needs at the edge we differ from many other crawling applications in which crawlers simply ingest content with minimal processing and thus are lightweight highly threaded applications designed for minimal network latency and maximum parallelism. In our case because we push so much work to the edge (to take advantage of this large pool of processors and most importantly to minimize global data movement), our needs are quite different.

Thus, as part of our GDELT 3.0 architectural reimagination, we are moving to a new core crawler design that uses blocking IO and separate execution threads, separating network and computation into dedicated lifecycle portions of each request, as well as allowing the kernel scheduler more freedom. Preliminary tests have shown absolutely dramatic performance increases and a total elimination of data loss even at vastly greater ingest scales, not to mention considerably easier development and debug workflows with blocking IO and well-defined execution flows. We'll be slowing rolling out this new architecture so stay tuned!