Search At Spanner Scale: Optimizing The TV News Explorer’s Timelines By Using A Fixed Count Limit

Spanner's raw hyperscale performance has been on full display as we've been benchmarking and tuning the schema and architectural design of the forthcoming Spanner-powered TV News Explorer. Using a 4-node Spanner Enterprise Plus SSD deployment and date-sorted and date-ordered index layouts, retrieval queries return in realtime with almost imperceptible cluster load, despite searching an index of more than 1.5 billion seconds of OCR'd airtime with a fulltext index of more than 1.4TB. Improving the performance of our timeline aggregations has remained a major challenge, however, with query times rising from sub-second returns for ordinary keywords to tens of seconds for the most pervasive terms like "trump" or "live".

For example, creating a day-level timeline aggregation across the entire 1fps OCR'd text of three channels spanning 15 years takes just 300ms and 1.3s of CPU (summed across all 4 nodes) when searching for an uncommon word like "Estonia" that matched 220K rows. In contrast, a term like "live" (the typical onscreen label for American television news designating that the footage is being broadcast live on air rather than prerecorded) matches 1.14 billion rows and takes 47s and 731s combined CPU time to generate the timeline aggregation, placing significant load on the cluster. While we could use a stopword list to block searches for ubiquitous words like "live", we obviously have to allow searching for terms like "Trump" that take 13s and 255s CPU time and match 382M records, so we had to find a way to speed things up.

When we first began our optimization journey, the extended runtime of queries like "live" or "trump" were huge focal areas as we worked to prevent queries for common terms from overwhelming the cluster and slowing searches down for all users. We extensively optimized the table and index schemas, precomputing date aggregations, explored sorted and ordered sharding, changing the number of nodes, etc, as we tried to bring down these long-tail query times. As we started deeply analyzing the query plans using Gemini, however, we realized that Spanner's fulltext engine was actually extremely performant and efficient – the long latency was due simply to the sheer number of records that need to be processed for long-tail queries.

When Spanner is asked to make a timeline of the number of seconds in which the word "live" appears over the past 15 years across all three core channels, it consults the index and finds that there are 1.14 billion matching rows. Even though the fulltext index is a covering index, Spanner still must scan 1.14 billion datestamps (these have already been pre-aggregated from second-resolution timestamps to day-level datestamps) totaling more than 7.6GB. Just reading 1.14 billion datestamps from disk, uncompressing and unpacking them and then aggregating them into day-level bins is simply extremely computationally intensive and will never be something that can finish in a few hundred milliseconds – we are simply approaching the limits of what can be done computationally in realtime.

Thus, rather than the long-tail latencies being a limitation of Spanner, we find that Spanner is actually being exceptionally performant, but there is simply no way that any search platform can scan and aggregate hundreds of millions to billions of records to generate a precise exact timeline of ubiquitous search terms. The latency lies not in the search stage, but in the immense volume of data that must be processed and aggregated at query time.

For the majority of search terms, latency is a few tens to hundreds of milliseconds and it is only a relatively small universe of terms that yields so many results as to bog the system down. Initially, we tried addressing this by creating a "recency" table that contained a rolling window of just the last 3 months of data with the recognition that the majority of searches are for topics and events that actively in the news right now and where the primary interest of the journalist or scholar is reporting on the most recent coverage, rather than a retrospective look at a decade and a half of coverage. A journalist reporting on the reflecting pool, the Iran MOU or developments in Ukraine are all likely to care most about the last few months. Unfortunately, while this reduced the number of records Spanner had to process for high frequency terms, many words like "Iran" still yielded tens of millions of results.

Ultimately, we came up with what we believe is the perfect compromise for the TV News Explorer: we will initially limit the timeline to the most recent 100K results, with the option for the user to click a button to expand the timeline to the full set of results with the understanding that it might take up to a minute to generate. For most searches, they will likely yield less than 27.7 hours of onscreen mentions and thus yield an exact timeline that accounts for all mentions 2001-present. For all other searches, the timeline will extend back over the most recent 100K results, likely covering the most recent several years for most queries. Only for the highest-volume queries like "Trump" or "LIVE" will the timeline cover just the past few days with a warning message to the user that the timeline has been truncated. Most importantly, this allows us to enforce a maximum fixed cost for all queries, regardless of the number of actual search results, placing an upper bound on the resource consumption and latency of even the largest queries.

The GDELT Project

Search At Spanner Scale: Optimizing The TV News Explorer's Timelines By Using A Fixed Count Limit

Archives