The GDELT Project

Search At Spanner Scale: Retrieval, Not Search, Is Now The Limiting Factor At Scale

One of the most methodologically interesting aspects of the AI revolution is the way in which it has fundamentally altered the scale at which we can ask the most complex reasoning-intensive questions of our data. Historically the goal of any analytics pipeline was to rely as heavily as possible on the search pipeline to winnow down the amount of data that had to be analyzed and to focus more on the kinds of codified metadata like entities, tags, labels, knowledge graphs and other codified entries that could be readily represented and filtered using traditional database indexes. Fulltext search at scale often relied on approximations and precomputation and the end goal was always to return to the user or analytic pipeline  just the bare minimum number of the most relevant results. Indeed, whether using metadata indexes, graph extraction or fulltext search, the end goal of search was always the same: return the fewest most relevant results possible, such as a page of a few dozen search results rather than all 100 billion matching web pages.

The AI era has upended all of this, with one of the most powerful use cases for AI lying in its ability to discern the subtle patterns and connections that are only visible when looking at data at scale. In the past, one of the greatest limitations was the scalability of the search infrastructure to tractably and performantly search immense datasets. Today's hyperscale databases like Spanner are capable of such scale and parallelism that it is possible to search even the largest datasets under the heaviest loads in effectively realtime given a sufficient number of nodes. The limitation now lies in the need for AI-driven analytic pipelines to examine an increasing fraction of those results. Instead of returning a few dozen to a few hundred results, the most advanced workflows may need to examine billions of results, shifting the computational burden from the search stage to the retrieval stage.

For example, many of our new Gemini-powered data journalism experiments require precise day-level timelines counting the exact number of times a given phrase was spoken or appeared onscreen on global television news over the past quarter-century in order for Gemini to reason about the temporal patterns in its coverage. While Gemini has access only to the final CSV timeline that contains only those daily counts and cannot see the content itself, generating many of these timelines requires Spanner to scan, unpack and aggregate hundreds of millions to billions of date records to form the final timeline. A fulltext search index can be heavily optimized to return the top 100 results for a keyword in milliseconds. Unpacking billions of INT64 dates from disk to make an exact timeline of the appearances of the word "trump" or "live" on television news simply approaches the limits of what can be done computationally in realtime. In other cases, we are exploring AI-assisted filtering workflows in which a human user describes in rich detail the precise search results they are interested in and a model performs post-filtering of the keyword results to identify the most relevant matches to return to the human user, with no data being used to train or tune any model.

As we relaunch the massively expanded Television News Explorer, we are reimagining many of the design principles of the Explorer as we think about how to accommodate this new retrieval-centric model of search. Across the board, we're leveraging Spanner's immense speed and scalability to make even the largest of these new retrieval-centric workflows tractable.