Our Journey Towards User-Facing Vector Search: Evaluating Elasticsearch's ANN Vector Search RAM Costs

As we continue our journey towards offering realtime user-facing semantic search over our growing collection of embedding datasets, we are evaluating various ANN (Approximate Nearest Neighbor) solutions designed for user-facing realtime vector search. In the past we've demonstrated how to perform at-scale clustering and search via BigQuery at mass scale and Python scripts at smaller scale or further out of band, but for realtime search we need a scalable online realtime ANN solution. Given Elasticsearch's near-ubiquitous presence as a ready scalable enterprise search solution and its growing support for embedding-based vector search, let's look at the theoretical numbers of what it would take to support realtime search over some of our embedding datasets.

Elasticsearch supports full ANN via Lucene's HNSW indexing for scalable vector search (it also supports exact matching, but that is simply intractable at scale). Most of our current embedding datasets are either USEv4's 512-dimension text embeddings, Gecko's 768-dimension text embeddings, or Vertex Multimodal's 1408-dimension vectors. According to Elastic's documentation, to maintain optimal performance, Lucene's HNSW algorithm requires the entire index to fit within the page cache, meaning there must be sufficient RAM for the entire index to remain memory resident. Thus, unlike disk-based inverted indexes, there is a much higher cost to offering vector search.

Elasticsearch supports both float32 (4 bytes) and quantized (1 byte) vectors (via element_type) and can even perform quantization itself. This yields two formulas for the RAM required to maintain the index:

  • Native: num_vectors * 4 * (num_dimensions + 12)
  • Quantized: num_vectors * (num_dimensions + 12)

How do these line up with our own datasets? One of our textual datasets encoded in USEv4 512-dimension vectors currently holds 2 billion embeddings, which we are evaluating upgrading to Gecko 768-dimension vectors, while a separate dataset we are evaluating will eventually yield 3 billion multimodal 1,408-dimension vectors. Thus, using the above math we get:

  • 2 Billion @ 512-Dim Float32: 2,000,000,000 * 4 * (512 + 12) = 4.192TB RAM
  • 2 Billion @ 512-Dim Quantized: 2,000,000,000 * (512 + 12) = 1.048TB RAM
  • 2 Billion @ 768-Dim Float32: 2,000,000,000 * 4 * (768 + 12) = 6.24TB RAM
  • 2 Billion @ 768-Dim Quantized: 2,000,000,000 * (768 + 12) = 1.56TB RAM
  • 3 Billion @ 1408-Dim Float32: 3,000,000,000 * 4 * (1408 + 12) = 17.04TB RAM
  • 3 Billion @ 1408-Dim Quantized: 3,000,000,000 * (1408 + 12) = 4.26TB RAM
  • Combining both Gecko and Multimodal Datasets @ Float32: 23.28TB RAM
  • Combining both Gecko and Multimodal Datasets @ Quantized: 5.82TB RAM

Thus, even our smaller dataset requires more than 1TB RAM, while combined, hosting both datasets will require nearly 24TB of RAM in native float32 resolution of almost 6TB quantized.

Remember that these numbers refer to the page cache, which comes ON TOP of the JVM heap. For data-intensive Elasticsearch clusters, Elastic typically recommends 64GB of RAM per node, with 50% devoted to the JVM heap (though no more than 32GB due to compressed object pointers and GC issues). Thus, if we fit this entire dataset into a single VM, the M2 machine family could fit the quantized dataset, but no single system could fit the native float32 dataset. In a single-node SSI configuration, the 32GB Elasticsearch JVM heap would be inconsequential compared to the page cache. In contrast, if we chose the more likely route of a cluster of smaller VMs that collectively totaled 6TB of RAM for the page cache for quantized datasets, each node would require 32GB of JVM heap, potentially vastly increasing our total memory load. For example, if we used 64GB RAM nodes as our base, half of each VM would be devoted to the JVM heap, leaving just 32GB for page cache, meaning we would need 6TB / 32GB = 188 64GB RAM nodes just to fit a single copy of the entire dataset into RAM with zero redundancy. The pricing for an M2-hypermem-416 with 8.8TB of RAM and a 6TB PD-SSD is $50,118 per month, whereas N2 2-vCPU 64GB RAM with 70GB PD-SSDs is $268 * 188 = $50,384, while an N2 2-vCPU 1024GB RAM with a 1100GB PD-SSD is $3,559 * 6 = $21,353.

Another question is whether Local SSD disk is sufficiently performant for HNSW's random IO patterns to bridge the gap between traditional PD-SSD and RAM. Would Elasticsearch on a standard-memory VM with Local SSD achieve sufficient IOPS to yield reasonable performance without storing the entire dataset into RAM? We'll be exploring such questions in forthcoming benchmarks.

While quantization is a widely-deployed optimization technique and yields accuracy reductions not commonly noticed by most use cases, it is an open question as to whether the high-precision use cases of our specific user communities would find their applications sufficiently negatively impacted that quantization is not an option for us. This will require further investigation as well.