Semantic search, Retrieval Augmented Generation (RAG), multimodal search – what do all of these technologies have in common? They all rely under the hood upon embedding models. At their core, each rely upon a so-called "embedding" model that transforms a textual, visual, audio or multimodal input into a vectorized sequence of numbers that represents it in a high-dimensional semantic space. Yet, these models, like everything else in the AI landscape, are being regularly updated. The problem is that an updated embedding model yields completely different results from the previous model, meaning that each time a model is updated, if the old model is depreciated, an enterprise has no choice but to reprocess its entire archive through the new model to generate new embeddings. Most major commercial embedding providers so far have exempted their embedding models from their depreciation schedules, but a growing number are previewing or indicating that depreciation schedules are coming. This means that an enterprise that spends $10M to compute embeddings over a massive multimodal archive and uses a commercial embedding model with an SLA that states each model will only be available for 6 months before being shut down and replaced with a new one, will have to spend that $10M over again every 6 months like clockwork. Few companies are thinking about the ongoing costs of embedding-based architectures – what are some of the major things they should be considering?
Most companies today with large content archives are at least exploring embedding systems, typically either for semantic or multimodal search or for generative search. Generative search engines rely upon RAG, which can be instantiated in two ways: keyword-augmented and embedding search. The former merely takes the user's query and uses the LLM to transform it into a series of ordinary keywords that are then used for traditional keyword search, with the results being provided back to the LLM for summarization, Q&A, etc. The latter is the more commonly understood RAG architecture, in which the search index is run apriori through an embedding model to generate embeddings for each documents and then at search time, the user's query is transformed into an embedding, a vector ANN search is run to find the top K documents whose embeddings are the shortest distance from the query embedding and then the underlying documents are provided back to the LLM for summarization.
No matter how embeddings are used, at the end of the day they represent an AI model that is run on an input to provide a transformed output. Unlike almost any other AI model in which the output can continue to be used if the model goes away, embedding outputs are useless for the majority of use cases if the model is ever unavailable. For example, if a vast archive of documents is translated through a given machine translation model and it is replaced with a better one, there may be a discontinuity in error and quality between the older and newer model-generated content, but the original content can still be used as-is.
In contrast, imagine a sematic or generative search solution accessing an archive of 100 billion multimodal documents that have been run through an embedding model at a cost of $10M. When a search is performed, the user's query must be run through that exact same model to convert it to an embedding to be used to search the archive. If the embedding model is depreciated and goes away, those 100 billion embeddings are effectively useless, since user queries can no longer be transformed into the same embedding in order to search it. The corpus can be clustered, but new documents can't be added, existing documents can't be updated and it cannot be searched. Companies today tend to focus on generating embeddings, rather than asking what happens if that model is no longer available, or they are fixating on small test corpuses that can be easily rerun, rather than evaluating what happens when a multi-million-dollar embedding corpus is suddenly rendered useless by an embedding model turndown. Worse, these model turndowns are set by the hosting companies and while larger big-spending enterprise customers may have some influence in extending turndown dates, most companies will be forced to upgrade within a timeframe established by the hosting company, which might be during a high-growth period that could negatively impact its ability to fully leverage its growth as it has to regularly divert resources to reembedding and rebuilding its ANN infrastructure.
Spending millions of dollars to repeatedly reembed a vast content archive may not even be a rounding error for some companies, but for many enterprises that is not an inconsequential sum, not to mention the cost of reindexing the embeddings into their ANN vector search solutions and the infrastructure required to support parallel infrastructures during the regular upgrade periods.
What are possible solutions to this?
- Self Hosting. For our own embedding work, we self-host all of the models we use on our own GCE VMs. This ensures we have continual access to them, though there is still the risk that a model depends on a framework or framework features that are eventually depreciated and thus access is lost anyway, but control over the model hosting allows companies to have more control over their upgrade schedules.
- Contractual Guarantees. Companies using hosted models should require legally-binding contractual guarantees that provide for a fixed turndown date in much the same way they typically negotiate support and availability contracts for other enterprise software. Companies should evaluate the total estimated size of their archives they plan to embed, the cost of embedding the full archive, the cost of updating all of their ANN infrastructure, the cost of downtime/upgrade time, etc and compare against the benefits of regularly refreshing to the latest models and decide how often they want to upgrade. A company with a relatively modest content archive that might cost only a few hundred dollars to reembed might not care at all. A company paying $1M to embed a corpus but who can afford to pay for the absolute latest model updates might choose a contract that assumes they will pay $1M to reprocess their entire archive every quarter, while a company with a massive archive that costs tens of millions to embed might ask for a contract that guarantees model version access for multiple years, etc.
Few enterprises have had to confront these issues to date, but as embedding models are deployed at ever-larger scales and commercial hosting providers begin enforcing their model depreciation schedules for their embedding models, this will become a growing consideration.