AI In Production: A Deep Dive Into The Costs Of Multimodal Embedding Search Over 3 Billion Images

As we continue our behind-the-scenes series looking at AI technologies in real world production use cases, we've been estimating the costs of what it would take to provide multimodal search across the entire Visual Explorer totaling 3 billion still images sampling the underlying broadcasts every 4 seconds. What would it take to make it possible to perform interactive realtime textual and image similarity search across this archive?

In the end, creating the embeddings themselves costs anywhere from $180,000 to $300,000 using one of the major commercially hosted APIs and a monthly cost of several thousand to update, while offering realtime vector search over the full dataset costs anywhere from $80,000 a month to more than $9M a month depending on service and capabilities, offering a reminder of the significant tractability limitations of realtime embedding search over large datasets. In fact, even the largest managed production-grade ANN vector search services enforce limitations of maximum vector counts, sizes, QPS or other limitations that restrict datasets to just a small fraction of those supported for traditional search, while very large embedding search is still more of an active research question than it is a solved COTS offering. Companies wishing to deploy ANN vector search solutions with embeddings for semantic search or RAG generative AI retrieval will find significant limitations to the current state of the art in realtime ANN that limit their deployment for large-scale embedding datasets or impose significant hardware requirements that limit their current ability to deploy at very large scale. In fact, large-scale ANN vector search is very much an active area of research, with datasets in the mere billions of records being considered cutting edge for such search services. The nature of ANN solutions also makes it extremely difficult to offer the kind of timeseries analyses many GDELT applications require, with ANN algorithms devolving towards brute force performance the more they are tuned to provide representative time series results (due to the need to return a large number of results above the given cutoff threshold for proper temporal binning).

Let's look at the cost breakdown of embedding 3 billion images and offering realtime ANN search over the full dataset in more detail.

First there is the cost of generating the actual embeddings themselves. GCP's Multimodal embeddings cost $0.0001 per image, meaning that computing the embeddings for all 3 billion images would cost $300,000. AWS' Titan Multimodal Embeddings cost $0.00006/image, yielding a 3-billion-image total of $180,000, while Azure charges the same as GCP, with a total bill of $300,000. Thus, just creating the embeddings themselves costs between $180,000 and $300,000 using a commercially-hosted multimodal model. We currently live-process 45 channels, producing (45 channels * 86400 seconds per day / 4 seconds per image) = 972,000 new still images per day. Under Azure and GCP's pricing, this would cost $97.2/day or $2,912/month, while AWS would cost $58.32/day or $1,749.6 a month. If the underlying embedding model is depreciated, the entire dataset must be re-embedded from scratch.

Now we have 3 billion 1,408-dimension embeddings. What does it cost to actually serve these up via a realtime ANN search service?

Elasticsearch's built-in vector search uses Lucene's HNSW algorithm, which requires that the entire index be maintained in RAM via the page cache for optimal performance. Using the embeddings' native float32 values, this would require 17.04TB of RAM or 4.26TB RAM using quantized values. It is unclear the impact of quantization on the high-precision matching that scholars and journalists typically request of the Visual Explorer, so let's price both options. The quantized dataset costs between $21K and $50K a month with no replication, meaning the loss of a single node causes the search service to fail and QPS is limited. The native embedding dataset would cost between $64K and $155K a month – again with no replication, meaning a single node loss causes service failure and QPS is limited. Just providing a single layer of replication and boosting QPS would cost $100-300K a month. In essence, the monthly hosting cost just to make the embeddings searchable can cost as much as actually creating those embeddings in the first place. A more reasonable more resilient production-grade deployment could easily cost more than half a million to three quarters of a million dollars per month, not counting the human administration cost of the underlying hardware. It is unclear the degree to which Local SSD could offer reasonable out-of-RAM performance and we will be benchmarking this moving forward.

What about a commercial vector search service? Azure AI Search supports 8.4 billion floats per partition on a high-density lower-performance L2 and a high-performance S3 node, so 3,000,000,000 * 1408 = 4,224,000,000,000 / 8,400,000,000 = 503 partitions required to store the entire dataset. Each L2 system costs $5,604.21 and supports a maximum of 12 partitions at significantly reduced QPS and increased latency, while higher-performance S3 systems cost $1,962.24 per month. (L2 systems have higher overall storage capacity than S3 systems, but their vector capacity is identical at 432GB total per SU). Each SU can handle 12 partitions, meaning using either an L2 or S3 system will require splitting the data into 42 separate SUs. (You can also reach this sum by taking 17TB total index size / 432GB per SU ~= 40). Combining 42 L2 nodes costs $235,376.82/month, while 42 S3's cost $82,414.08.

What about GCP's Vertex AI Vector Search? Vector Search breaks search into two separate tasks: the monthly index building cost and serving the index via VMs. You pay separately to construct the index and for the cluster of VMs that make that index searchable. If the entire dataset was precomputed and batch loaded at the start, the monthly batch indexing cost is "number of examples * number of dimensions * 4 bytes per float * $3.00 per GB", totaling $50,688 per month for the index construction and maintenance. Of course, this is not a static index – it continues to grow in realtime, updating every few minutes. To minimize indexing costs, let's use Streaming Updates to add new data. These cost $0.45/GB, but "your index is rebuilt when streamed data reaches 1GB, or after 3 days, whichever comes first. The full index rebuilding job for Streaming Updates is charged at the batch index building price of $3/GB." This means that after we have appended 1GB of data or 3 days after a smaller insert, a full index rebuilding is triggered. Assuming we slow our updates to only run daily, rather than in realtime, this would be charged 517TB effective indexing cost per month, for a total of $1.5M for indexing alone, according to the GCP Pricing Calculator. Updating the index hourly would be charged at 12.4PB or $37.2M per month. Given that we generate 972,000 new images per day, that works out to 972,000 * 1,408 * 4 = 5.5GB/day of new embeddings, which means based on streaming updates triggering a full-index update every 1GB, we will be triggering a full index update up to 6 times a day, meaning our actual cost would be 3.1PB of charged index updates per month totaling $9.3M.

While GCP does not publish an official formula for estimating serving costs, from the recommended serving configuration table, we find that an effective formula is "number embeddings * number dimensions * 4 bytes per float * QPS" and divide by 12.5 billion to arrive at the number of serving nodes, then multiply by 1.012 * 24 hours * 30 days. In this case, we get $48,090 a month in serving costs. Thus, index construction is the dominate cost.

Thus, for GCP, serving this dataset as a static unchanging dataset with no additions would cost $50,688 index building + $48,090 serving = $98,778 per month. If the dataset was updated in realtime, the cost could exceed $9M a month.

GCP does not publish an absolute limit to the number of floats a single index can contain (it references only "billions"), though Azure AI Search has an upper bound of 21 billion documents per index, but larger vector datasets will max out below that level due to vector storage being limited to just a fraction of total cluster storage.

Of course, this comparison does not evaluate the respective services' maximal and median QPS, latency, stability or other factors, only raw service provision cost. But, it offers a stark reminder of the exponentially higher costs of realtime embedding search and why it may not be suitable for very large datasets for many enterprises.