The GDELT Project

Append-Based Keyword Search Versus Rebuild-From-Scratch Embedding Databases: Technical & Methodological Challenges

One of the most underappreciated challenges of large-scale embedding database implementation in the enterprise is their fundamental difference from keyword-based search indexes: they must be rebuilt from scratch with each model update, rather than merely appended to in perpetuity like keyword systems, creating immense technical and methodological challenges that enterprises have yet to fully appreciate.

Traditional enterprise keyword search systems are based on the concept of an inverted index, typically sharded over clusters of hundreds, thousands or even tens of thousands of machines. Inverted indexes are functionally append-only systems (though most support updates and deletions and technically perform more complex operations due to compacting, etc). As new documents are indexed, their constituent words are simply appended to the index. Critically, this means that as new words emerge into the lexicon, no special handling or index rebuilding is necessary: they are simply added in like any other word. Thus, as "Covid-19" or "Mpox" sprung into being, no special action was required from keyword search engines: the words were simply indexed like any other and were immediately available for search. This is a critically important property of keyword search engines: emergent or rare terms (such as a new startup with an invented word as its name) are searchable the instant they are indexed. Searchers therefore do not have to take into consideration whether they terms they wish to search are novel words or preexisting words – they just search.

In contrast, embedding models represent a snapshot of language at the time they were created. New words that emerge are not represented by the model and are either ignored or are tokenized into substrings of letters that do not capture the actual meaning of the word. For example, even "mpox" is not captured by the latest generation of LLM-based embedding models given that most have knowledge cutoffs in mid-2021, while "Covid-19" does not exist in many of the most commonly-used previous-generation embedders. Searches that incorporate these terms may still succeed to some degree if those terms dominate disease-related content indexed into the underlying ANN database, but for short queries the results may be largely irrelevant and even for longer queries, the percentage of unrelated results will often be substantial.

Existing terms that change connotation or use may also confound embedding search. Globally-dominating topics like Covid can overwhelm the model's captured relationships. Unexpected relationships can emerge, such as Covid-related terms returning Ebola coverage rather than Covid coverage for selected queries as the global pandemic contextualization of Ebola's later coverage overwhelms the model's ability to topically constrain the search. Changing connotations and usage can also reduce the effectiveness of embedding search.

In other words, an embedding-based search database is equivalent to freezing the word lookup of an inverted index at a moment in time and not indexing any words that arrive in the future that weren't in the index at that moment: the results may still be somewhat related, but performance will be substantially degraded.

Embedding models don't have the concept of isolated appending: any updating that adds a new word also affects the model's results for other words. This means that in practice, any major model updates require recomputing the embeddings for every document and completely reindexing the entire database from scratch.

It is true that keyword-based inverted index databases must typically be reindexed periodically during major version upgrades. Many databases offer binary-compatibility for minor version upgrades, meaning the existing indexes can often be reused or minimally changed. For major upgrades, however, often the entire database may need to be reindexed from scratch. Reindexing, however, is nearly exclusively IO-bound rather than compute-bound. This means that the reindexing process can be dramatically accelerated through the use of high-IOP temporary direct-attached disk or RAM-disk that is then block-copied to standard persistent SSD disk after index construction. Disk is also relatively cheap in the commercial cloud. Critically, however, companies can schedule these upgrades according to their own needs. In the midst of a vertical surge in demand for Covid-19 searches, a company can delay its planned upgrade and reindexing for a few months until load subsides. In contrast, if an embedding model does not properly represent a critical new term, the company might be forced to deploy a new model unexpectedly under load if relevance is insufficient.

Importantly, keyword-based inverted indexes involve only reindexing. A database of hundreds of billions of documents need only be split into words and positional information.

Embedding databases, on the other hand, involve both reindexing and recompute. A database of hundreds of billions of documents must first be reembedded, with every single document processed through the new embedding model. Larger embedding models typically have strong compute requirements and can take a large fraction of a second or even multiple seconds to generate the embedding for a given passage. Some even require multi-GPU or even GPU-cluster environments, vastly increasing their costs.

Hosted models that charge a per-token rate will incur substantial costs for reembedding. For example, OpenAI's Davinci V1 model costs $0.2 per 1,000 tokens. For an archive of English documents (where OpenAI's tokenization rate is typically one-to-one per-word), an archive of one billion 1,000-word articles would cost $200M to reindex. Even its cheapest Ada v2 model at just $0.0001 per 1,000 tokens would cost more than $100,000 to reindex. These numbers exclude rate and capacity limiting that most commercial providers enforce that would likely require the reprocessing to be spread over time rather than completed all at once in a single rapid burst. Even self-hosted models can strain available resources given the limited availability of both GPUs and high-performance CPUs in both inhouse and cloud environments: even within the commercial cloud it can be difficult to rapidly acquire tens of thousands of late-generation GPUs on a moment's notice.

Embedding models are unpredictable when changed. The addition of a single word can have unforseen cascading impacts across the model, with previously-strong relationships among other terms suddenly weakened or broken. This, in turn, can have downstream impacts on an enterprise if categorization models, for example, depend on the embedding model clustering terms in a certain way. While reindexing a keyword-based model will have no downstream impacts, adjusting an embedding model can have existential breaking impacts on downstream applications.

One the documents have been fully reembedded, scalability can be a significant challenge in embedding search applications. Most ANN (approximate nearest neighbor) implementations, including hosted versions, tend to have significantly lower scaling characteristics than their inverted index equivalents, especially with higher-dimensionality embeddings. For example, a one billion document repository with 2,048-dimension embeddings represents around 2048 x 1B x 4 = 8.2TB (using x 4bytes as a rule of thumb). For one large commercial cloud ANN offering, that would represent 8.2TB / 20GB per shard = 410 shards at a maximum of less than 1,000QPS, with 820 deployed index nodes required for 1-2K QPS. In contrast, the default maximum number of available nodes for that product is limited to 50 and can typically be increased through 100 in this particular product. While ANN solutions continue to rapidly scale, they impose vastly more limiting scaling options for index size, max QPS and latency compared with inverted index solutions.

Methodologically, not only do embedding models suffer from their inability to accommodate emerging terms and changing connotations, but their inability to perform exact matches means they present unique challenges when it is a specific term, such as the name of a company or a precise statement or meme, that is of interest, rather than a topic. Embedding models, for example, are not typically suitable candidates for searching for excerpted leader statements or the name of a new startup. Instead, they are typically paired with keyword search, where embedding search is used to select a set of top-K candidates that are then reranked via exact keyword match, but this is in turn constrained by how well the embedding model surfaced these results.

In the end, embedding models offer powerful new semantic search capabilities, but carry with them very real and unique technical and methodological challenges that many enterprises have less familiarity with addressing.