We've run Elasticsearch clusters on GCP for almost a decade across many different iterations of hardware and cluster configurations. What are some of the lessons we've learned and what does Elastic itself recommend for running under GCP? Let's start with what Elastic itself deploys on GCP for its Elastic Cloud.
The company publishes a guide to its various node types and their underlying hardware configurations. Their standard hot data node configuration consists of 68GB RAM (64GB RAM + 4GB utility overhead) with 10, 16 or 32 N2 vCPUs, Local SSD via NVMe and a disk:memory ratio of either 45:1 or 95:1, meaning that for each 1GB of RAM, the node houses 45GB or 95GB of data. Their CPU-dense hot data configuration offers 68GB RAM, 32 N2 vCPUs and 68×45 = 3.06TB disk, while their storage-dense hot data configuration offers 68GB RAM, 10 N2 vCPUs and 68×95 = 6.46TB disk. Their warm data configuration offers 68GB RAM, 10 N2 vCPUs and 68×190 = 12.92TB HDD (pd-standard) spinning disk. Also note that hot nodes utilize ephemeral Local SSD disks for maximal performance, but this means that upon node restart all data is lost and thus warm nodes utilize spinning persistent disk (pd-standard).
Note that while the nomenclature can be a bit confusing, each of the nodes above is a single SSI VM (ie, the actual base machine type equivalent of the 16-core 68GB is an n2-standard-16 – though here they actually run custom machine shapes) – in other words, an 8 vCPU instance is not a cluster of 4 separate 2-vCPU VMs, it is a single 8-core VM. These larger nodes are then used as shared tenant environments with multiple customer deployments layered onto each VM, with pricing based on support level and hour/GB RAM pricing (see calculator and pricing table). In other words, Elastic Cloud's hot data fleet consists of a set of fixed size VMs onto which customer projects are layered onto, with each VM hosting multiple customer deployments charged by GB of RAM.
The base GCP price of a CPU-dense hot data node (68GB RAM, 32 N2 vCPUs, 8 Local SSD disks = 3TB disk) is $1,024/month (assuming a 20GB boot disk), while the base GCP price of a storage-dense hot data node (68GB RAM, 10 N2 vCPUs, 16 Local SSD disks = 6TB disk) is $841/month. Note that NVMe attached Local SSD disk achieves 680K read IOPS / 360K write IOPs and 2,650MB/s read and 1,400MB/s write throughput for 4-8 disks, then jumps to 1.6M read IOPs and 800K write IOPs and 6,240MB/s read and 3,120MB/s write throughput at 16 disks. An N2 VM caps out at 24 attached Local SSD disks, achieving 2.4M read and 1.2M write IOPs and 9.36GB/s read and 4.68GB/s write throughput. (see performance table)
At 24 Local SSD disks, to achieve the same results with Persistent SSD (pd-ssd) disks, which max at 15K IOPS and 240MB/s throughput per 2-core VM, we have a few options:
- 2.4M Read IOPs: 160 2-core N2 VMs which would yield equal 2.4M read and write IOPs and 38.4GB/s equal read and write throughput.
- 1.2M Write IOPs: 80 2-core N2 VMs yielding 1.2M read and write IOPs and 19.2GB/s throughput.
- 9.36GB/s Read: 39 2-core N2 VMs
- 4.68GB/s Write: 20 2-core N2 VMs
Intel's own benchmarks suggest that 3rd Generation (Ice Lake) N2 processors achieve 1.29x throughput and 1.36x indexing speedup over 2nd Gen N2 vCPUs on n2-highmem-8 VMs. GCP itself offers CoreMark generalized benchmark scores for all of its standard machine types. Elastic also offers Arm-based deployments as of 2021 for Elastic Cloud on AWS.
Elasticsearch recommends for typical search use cases that shards be relatively evenly sized, around 200M "docs" per shard totaling around 10-50GB each, with an upper bound of 2B "docs" per shard and around 1GB of heap per 3,000 indices on the master node. RAM per node is typically maxed at 64GB because the JVM shouldn't consume more than 32GB in order to utilize compressed pointers and minimize issues with GC. Some use cases may find higher RAM is useful in enabling greater file system caching, but the JVM itself will still be bounded at 32GB.
In terms of indexing performance, Elasticsearch suggests that bulk inserts around a few 10's of MB typically are the most performant so long as memory pressure is not excessive, while reducing refresh speed to 60 seconds where possible can typically yield a good mix of fresh data and optimized indexing. Strict field ordering can help, which can require the use of special flags in the JSON generator library given that many libraries randomize field ordering due to the use of hashes or other unordered structures under the hood.