Building An ElasticSearch Cluster To Search The Planet

When it comes to exploring patterns in GDELT’s immense data archives stretching back two centuries, nothing on this earth can come close to the power of Google BigQuery for its ability to bring to bear raw brute force capable of table scanning a petabyte in just under 4 minutes.

What happens after you have that analysis in your hands that shows you that protest activity is on the upswing in a certain region or that hate speech against a particular religious group is sharply rising or that over the last 30 minutes rebel fighters have been deploying across the capital, taking up firing positions aimed at the presidential palace? In each of these cases you likely want to search the full text of the vast global monitoring streams that feed into GDELT to connect you back to the original sources.

We are excited to announce that we’ve been hard at work making this dream a reality. Last month marked the debut of our new ElasticSearch cluster powering our new Television Explorer, offering fulltext search of more than 2 million hours of American television news stretching back 8 years. The incredible success of this deployment means we’ve been working hard behind the scenes to launch the second generation GDELT’s fulltext search infrastructure, designed around ElasticSearch. This allows us to marry the immense petascale analytic capabilities of BigQuery with ElasticSearch's purpose-built fulltext search infrastructure. We’ll even be rolling out new integration points with BigQuery, allowing you to conduct fulltext searches using ElasticSearch and then take the list of matching documents and use them with GDELT BigQuery to conduct complex advanced analyses!


What does it take to deploy ElasticSearch on Google Cloud Platform?

Surprisingly, for something with as much power under the hood as ElasticSearch, deploying an ES cluster on the Google Compute Engine (GCE) platform is incredibly easy.

The first step began at the end of last summer when we began extensive benchmarking and schema design, looking at the kinds of queries we currently support through all of GDELT’s various search APIs, the kinds of queries that we’ve heard loud and clear from all of you that you want GDELT to support and some of our own ideas about where the future of understanding global society is heading. This led to more than half a year of testing an incredible number of different configurations and ideas.

The end result of all of this testing is a set of data schemas that are incredibly optimized, but even with all of this optimization, the sheer magnitude of the amount of data that GDELT monitors each day means that our queries still end up being IO limited rather than CPU bound. So, we knew from this that we needed to design a production cluster that absolutely maximized disk performance.

On the GCE platform, SSD Persistent Disks reach their maximum sustained throughput of 240MB/s reads, 240MB/s writes and 15,000 IOPS at a provisioned size of around 500GB for VMs of 15 or fewer cores. However, GCE instances are also subject to egress caps of 2Gbit/s per core (approximately 256MB/s). Add to this the fact that Persistent Disks on GCE achieve their durability by writing each piece of data 3.3 times – this means that for each byte written, you are actually generating 3.3 bytes of  output. On a per-core basis, this means that for each core, your instance is allocated 256MB/s divided by 3.3 bytes of sustained disk bandwidth, or approximately 78MB/s per core.

A single core instance using SSD Persistent Disks thus will reach a peak write speed of 78MB/s with just a 163GB disk, while a dual core instance will peak at 156MB/s write performance with a 326GB disk. Jumping to a four-core instance increases these maximums to 240MB/s write performance with a 500GB disk and any increase in cores beyond this results in no additional bandwidth gain (though IOPS will increase and additional cores offers an increased maximum RAM cap that can be used for memory caching).

Putting this together, the jump from 1 to 2 cores results in a linear doubling of bandwidth, while jumping from 2 to 4 cores yields just a 1.5x increase and any increase beyond 4 cores does not increase bandwidth at all.

Thus, for IO-bound workloads like ours, dual core “highmem” (13GB RAM) instances with a 326GB SSD Persistent Disk offers the best performance/cost ratio and has the added benefit of increasing resiliency by spreading our data across multiple instances to allow for replica sharding.

Moving forward we are also exploring creating a special set of read-only replica nodes that utilize NVMe Local SSD disks, which can achieve up to 2.65GB/s sustained read bandwidth and 680,000 IOPs. We’ve made extensive use of such machines for some of our other massively IO-intensive workloads and found them to provide truly astonishing sustained IO capacity under real world workloads. Of course, since Local SSD is at present not persistent and data is lost if the instance crashes, such nodes would have to be used strictly as read-only replica nodes.

In terms of schema design, after six months of benchmarking we settled on using date sharding, where we split our data into separate indices by date. Some types of data, like television, are sharded at an annual level, with each year of data being its own index, while some ultra-high intensity tables are created at the daily level.

For example, our forthcoming fulltext geographic search, will allow you to search for any keyword or phrase and get back a GeoJSON file, suitable for visualization and analysis in any GIS or mapping system like Carto. The output is a list of the top 4,000 locations most closely associated with the search term and for each location the top 5 most relevant articles relating to the search term and that location are returned. This means, however, that a single search must compile the full details on 4,000 locations x 5 articles per location = 20,000 records. That’s equivalent to running a search and requesting the top 20,000 records. It even exceeds the default limit of 10,000 results imposed by the main ES search pathway (“max_result_window”). This places extraordinary strain on the ES infrastructure and results in extremely high disk bandwidth to answer large queries. By storing the geographic index as a series of daily ES indices that have been tightly compacted and highly optimized, we are able to speed these queries up to near-realtime responsiveness in the typical case and place a fixed upper bound on pathological performance. Still other tables are segmented by week or other time intervals based on the specific workloads they endure.

Due to the nature of the kinds of aggregations we perform to service these queries, we’ve found that sharding the indices any further in most cases results in reduced performance, similar performance with a significant increase in resource consumption or modest performance increases that are far exceeded by the additional resources consumed. Instead, we’ve chosen a specific date interval for each class of search index that offers the best ratio of performance vs resource consumption and used the date sharding to spread the query load across all available cores as part of historical searches.

Putting this all together, over the coming weeks you will be seeing a flurry of incredibly exciting new announcements on these pages as we launch a whole new era of search, analytic and visualization capabilities for GDELT, with new BigQuery-powered analyses to let you look across the globe holistically and our new ElasticSearch cluster to let you explore the global coverage that underlie those analyses. Welcome to the future!