We've run Elasticsearch clusters on GCP for almost a decade across many different iterations of hardware and cluster configurations. Earlier this week we examined the hardware Elastic itself deploys on GCP for its Elastic Cloud and compared with our own deployments over the past decade. We've also looked at the potential of Extreme PDs and Hyperdisks. Putting this all together, what does the future hold in terms of our likely storage journeys for our forthcoming next generation Elasticsearch fleet architecture?
While Elasticsearch embarked upon its stateless journey in 2022, internally, we have utilized stateless mass-scale search infrastructure since the very beginning, using GCS as a globally-distributed index store, with dynamically-sized cluster fleets appending, updating and searching, bringing to bear every imaginable machine shape, accelerator and external GCP API to assist. In fact, if we think of transformational APIs like Cloud Vision, Cloud Video and Cloud Speech-to-Text as indexing APIs and Vertex AI and its predecessors as a search API, we've been implementing advanced indexing and search pipelines since the dawn of GCP's AI journey, while BigQuery has played a nexus role since we first began using GCP.
At the same time, Elasticsearch has long played a critical role in powering externally-facing interactive realtime keyword search. In contrast to Elastic Cloud's reliance on larger hot data nodes of 10, 16 or 32 N2 CPUs with 68GB of RAM and 6 or 12 Local SSDs, all of our fleets in recent years have used a base unit of 2-vCPU N1 VMs with 32GB of RAM and 500GB PD SSD disks for the core fleets and 2TB PD SSDs for auxiliary storage dense fleets with less stringent latency requirements. Our core fleets consistently achieve their theoretical maximums of 15,000 IOPs and 240MB/s read/write throughput per VM.
As we look to the future, what architectures might be available to us?
- Locally Stateless GCS Storage. As discussed above, we use GCS as our primary index storage for a vast array of indexing and search requirements, but even with local PD SSD and Local SSD caching, the latency of truly locally stateless VM search is too high to power our frontline user-facing keyword search services at this time. These are our most scalable and flexible search environments, capable of petascale advanced AI analytics, such as deploying LSMs (Large Speech Models) to transcribe millions of hours of video across dozens of languages or computer vision to search petabytes of imagery, scaling the underlying hardware in realtime.
- RAM Disk. For our most demanding and latency intolerant search applications (typically realtime infrastructure facing and supporting specific analytic tasks), we deploy fleets of VMs with very high amounts of RAM in support of RAM disk-based search indexes. Some support highly advanced and even exotic search capabilities that make use of the uniquely high random access performance of RAM disks. In GCP, the current maximal SSI RAM disk is 12TB under the M2 series, though we achieve much higher by grouping VMs together and either performing virtualized RAID striping across them or sharding our data as possible. However, the cost of RAM disks restricts them to only the most latency-sensitive or random access-intensive applications.
- Local SSD. Local SSD offers the next-best performance to RAM disk. The number of Local SSD disks that can be attached to a given VM varies by machine type, with N1, N2 and N2D VMs supporting up to 24 Local SSD disks for a total of 9TB of storage at 2.4M read IOPS, 1.2M write IOPS, 9.36GB/s read and 4.68GB/s write throughput, while C3 supports 32 Local SSD disks for 12TB and 3.2M read and 1.6M write IOPs and 12.48GB/s read and 6.24GB/s write throughput and the unique Z3 family supporting 12 3TB drives supporting 36TB of disk totaling 9.6M read and 4.8M write IOPS and 37.44GB/s read and 18.72GB/s write throughput.
- Extreme Persistent Disks & Hyperdisks. The two highest-performance persistent block storage devices available on GCE are Extreme Persistent Disks and Hyperdisk Extreme. The former max out on N2 Ice Lake systems with 64+ vCPUs at 120K IOPs and 4GB/s read or 3GB/s write throughput, but are supported only for M2 (208 or 416 vCPUs), M3 and N2 (64 or 80 vCPUs on Cascade Lake, 64+ on Ice Lake) machine types – for all other VMs they fall back to standard SSD Persistent Disk (pd-ssd) performance (or the provisioned IOPs if lower). The latter tops out at 500,000 IOPs and 10GB/s read/write throughput on a C3 with 176 vCPUs. However, since a single Extreme volume maxes at 350,000 IOPs and 5GB/s, this requires attaching two Hyperdisk Extreme volumes to the same VM and sharding across both. Both disk types are only available for limited machine series and only achieve their top performance at very large vCPU counts. The equivalent PD Extreme system achieved through 32 2-vCPU N2 systems, but using ordinary PD SSD disk would total total 480,000 IOPs and 7.68GB/s read or write throughput – vastly higher than the PD Extreme disk, while the Hyperdisk Extreme equivalent 88-node 2-vCPU N2 cluster offers an astonishing 1.32 million IOPs and 21.12GB/s read/write throughput. For Elasticsearch deployments, the extremely high node size required to achieve the full performance of these disk types negates their practical use given that centralizing so much hardware in a single VM vastly reduces fault tolerance.
- Persistent SSD Disks. A cheapest durable and most widely supported storage option for Elasticsearch on GCP is Persistent SSD Disk (pd-ssd), which offers 15,000 read/write IOPS and 240MB/s read/write throughput on a 2-vCPU N1 or N2 VM. An 8-vCPU achieves the same IOPS but increases throughput to 800MB/s – though 4 2-vCPU VMs would collectively achieve 60,000 IOPS and 960MB/s throughput. PD-SSDs max out at 100,000 IOPS and 1.2GB/s throughput on a 64-vCPU VM, but the equivalent collection of 32 2-vCPU VMs would achieve 480,000 IOPS and 7.68GB/s throughput. Achieving Local SSD performance equivalent to the Z3 would require 640 2-vCPU N2 VMs to match IOPS and 156 VMs to match throughput. Given that our current storage-dense Elasticsearch nodes use 2TB PD-SSD disks, the equivalent Local SSD storage would require 6 Local SSDs at 680K read / 360K write IOPS and 2.65GB/s read and 1.4GB/s write throughput. To achieve the same performance with PD SSD would require 46 2-vCPU VMs.
- Hybrid Local + Persistent SSD. In the past we've explored combining the best of both worlds, using bcache to position a cluster of write-through Local SSDs fronting an equal-sized Persistent SSD per node. In this scenario, upon boot, reads gradually fill the Local SSD disks from the PD SSD until all read activity comes from the Local SSDs and writes flow through to both. Initial tests were promising, but concerns over past scalability and edge case stability under node duress and crash scenarios gave us pause to deploy in production, but we are reevaluating this option.
As we rapidly accelerate towards the deployment of our next generation Elasticsearch fleet architecture, we are focusing primarily on the last two options. Our existing N1 nodes perform so strongly that simply leveraging incremental improvements in N2 vs N1 performance and creative use of the latest Elasticsearch improvements may be sufficient to simply deploy our current storage dense 2TB node unit to our primary fleet template. At the same time, the rise of more CPU-intensive search modes like vector search may further change the equation towards more dependence on compute over IO in terms of balance and thus change our focus from incorporating Local SSD to deploying larger-core architectures like Arm families.