As we continue exploring how to help journalists and scholars move from rote keyword searches towards thinking of television news in terms of "stories" and "coverage," we continue to scale up our experiments around story segmentation and how we can create indexes that allow journalists and scholars to jump directly to stories of interest. One of the most important questions underpinning this work is understanding the limits of story indexing and what kinds of news coverage it works better or worse on. To allow us to study such questions at ever-larger scale, we have completed story cataloging of the entire 1.4M-hour closed captioned Television News Archive spanning portions of the past quarter century, using Gemini 2.5 Flash Thinking to "watch" all 1.4M hours by reading their closed captioning and semantically clustering it to compile an annotated catalog index of all the stories in that news coverage.
We used the standard off-the-shelf public Gemini 2.5 Flash Thinking model without any modifications. No data was used to train, tune or otherwise contribute to any model: we used Gemini only to create an index of each broadcast.
In all, more than 1.37 million broadcasts totaling 5 billion seconds (1.4M hours) and 10B words (104B characters) were cataloged, yielding an index of 12.7M stories across 50.7M clips (since stories are often reported on in multiple different spots in a broadcast). Gemini processed 38 billion input tokens (around 2.75 characters per token), yielding 13.8B cataloging output tokens and 15.9B thinking tokens. Those 12.7M stories covered 249.5M topics (a topic every 41 words), 121M emotions, 117M locations (a location mention every 86 words) and 102M entities (an entity every 100 words).
Incredibly, cataloging 5 billion seconds (1.4M hours) of airtime spanning 10 billion spoken words to index 12.7M stories covered across more than 50.7M clips cost just $42,900. Such costs are well within the realm of what large libraries could afford to construct catalogs of their large video archives to help journalists and scholars identify the stories and coverage of relevance to their research.