The GDELT Project

GCP Tips & Tricks: Observations On A Decade Of Running Elasticsearch On GCP: Part 3 – Future Storage Options

We've run Elasticsearch clusters on GCP for almost a decade across many different iterations of hardware and cluster configurations. Earlier this week we examined the hardware Elastic itself deploys on GCP for its Elastic Cloud and compared with our own deployments over the past decade. We've also looked at the potential of Extreme PDs and Hyperdisks. Putting this all together, what does the future hold in terms of our likely storage journeys for our forthcoming next generation Elasticsearch fleet architecture?

While Elasticsearch embarked upon its stateless journey in 2022, internally, we have utilized stateless mass-scale search infrastructure since the very beginning, using GCS as a globally-distributed index store, with dynamically-sized cluster fleets appending, updating and searching, bringing to bear every imaginable machine shape, accelerator and external GCP API to assist. In fact, if we think of transformational APIs like Cloud Vision, Cloud Video and Cloud Speech-to-Text as indexing APIs and Vertex AI and its predecessors as a search API, we've been implementing advanced indexing and search pipelines since the dawn of GCP's AI journey, while BigQuery has played a nexus role since we first began using GCP.

At the same time, Elasticsearch has long played a critical role in powering externally-facing interactive realtime keyword search. In contrast to Elastic Cloud's reliance on larger hot data nodes of 10, 16 or 32 N2 CPUs with 68GB of RAM and 6 or 12 Local SSDs, all of our fleets in recent years have used a base unit of 2-vCPU N1 VMs with 32GB of RAM and 500GB PD SSD disks for the core fleets and 2TB PD SSDs for auxiliary storage dense fleets with less stringent latency requirements. Our core fleets consistently achieve their theoretical maximums of 15,000 IOPs and 240MB/s read/write throughput per VM.

As we look to the future, what architectures might be available to us?

As we rapidly accelerate towards the deployment of our next generation Elasticsearch fleet architecture, we are focusing primarily on the last two options. Our existing N1 nodes perform so strongly that simply leveraging incremental improvements in N2 vs N1 performance and creative use of the latest Elasticsearch improvements may be sufficient to simply deploy our current storage dense 2TB node unit to our primary fleet template. At the same time, the rise of more CPU-intensive search modes like vector search may further change the equation towards more dependence on compute over IO in terms of balance and thus change our focus from incorporating Local SSD to deploying larger-core architectures like Arm families.