The Internet Archive's TV News Archive Turns 10: A Look Back At A Decade Of Pioneering Collaboration Reimagining Media Research

The Internet Archive's Television News Archive was launched to the public ten years ago today and GDELT and its founder Kalev Leetaru have been collaborating with the Archive for most of that past decade, inaugurating new ideas, workflows, infrastructure, interfaces and tools that have made it ever more accessible and analyzable by scholars and journalists and pioneering fundamentally new approaches to analyzing, exploring and understanding television news at scale.

As the TV News Archive enters its second decade, we look back on a decade of collaboration in forging the future of media research.

The TV News Archive's launch marked the debut of an extraordinary new public resource for journalism and scholarship and the unveiling of an unprecedented new tool for the citizenry of a democratically elected nation to hold their government accountable. Former FCC Chairman Newton Minow noted at the time that it "builds upon broadcasters’ public interest obligations. This new service offers citizens exceptional opportunities to assess political campaigns and issues, and to hold powerful public institutions accountable." Sunlight Foundation Co-founder and Executive Director Ellen Miller touted at its launch that "it expands transparency by making words and deeds captured on a historically hard-to-rewind medium easy to find and review. This searchable video library puts research tools once available to only a small media elite into the hands of average citizens, enabling us all make our leaders more accountable." PBS NewsHour Executive Producer Linda Winslow offered in 2012 that it "represents an unprecedented opportunity for the American public to take ownership over the news, to relive and reconstruct the past, to deconstruct the language and images of modern communication, and to gain a better understanding of one of the most powerful mediums in the world: Television." In a nod to just how revolutionary the idea of a publicly searchable TV news archive was at the time, CBS News president Andrew Heyward proclaimed at its debut that "you have to see this service to believe it – and even then, you may not. The Internet Archive has harnessed today’s extraordinary advances in computing power and storage capacity to capture virtually every national U.S. television news program and allow users to find and view short streamed clips on any subject."

Today the Archive preserves 9.5 million broadcasts spanning 6.6 million hours of television news from more than 50 countries dating back 20 years, with its continuous holdings of US national television news spanning more than a decade. Journalists and scholars have relied on its vast holdings for more than 2,000 articles and research papers over the past decade.

Earlier this year, the New York Times relied on the Archive for its a massive four-part series examining all 1,150 episodes of the Tucker Carlson show and how his data-driven approach to promoting conspiracy theories and falsehoods has helped drive topics like “replacement theory” into the national discourse. The New York Times’ research was cited on the Senate floor, in a Senate hearing and in a letter from the Senate Majority Leader to Fox News, covered by media outlets all across the world and named a finalist for the 2022 Online Journalism Awards (OJAs).

The research, infrastructure, tools and interfaces GDELT has created in collaboration with the Archive over the years has helped enable many of these projects.

In fact, GDELT has been working with the Archive almost since the beginning to pioneer the future of media research.

Less than a year after the Archive's public debut, GDELT's Kalev Leetaru unveiled the first at-scale interactive maps of the geography of television news, visualizing the “where” in the stories Americans were hearing each day. This was followed a year later by the first at-scale visualizations of the emotion of television news, applying sentiment mining tools covering thousands of dimensions across half a decade of television news.

The immense promise of these early studies led to the conception of the Archive’s “virtual reading room” in which scholars could submit data mining algorithms to be run on the Archive’s servers to enable at-scale non-consumptive scholarship over these vast holdings of television.

No comparable archive of public interest video archive of this scale had ever been made accessible beyond summary form to scholars and journalists, so there was no existing roadmap to follow. How could petabytes of video and millions of hours of closed captioning transcripts be made non-consumptively accessible to researchers? Would scholars just need a login to a secure computing cluster at the Archive to run non-consumptive computing codes? Or would translating all of this data to scholarship require the creation of entirely new kinds of interfaces and interaction methodologies?

In time, we discovered that what scholars and journalists needed most of all was ready-made interfaces where they could simply search and browse the collection through precomputed metadata and search tools. This led in 2015 to the Campaign 2016 tracker hosted by The Atlantic that tracked television news coverage of the major presidential candidates. This captured the attention of the journalism world in showcasing the potential of the Archive as a data source for creating a more informed electorate.

That same year we released three other interactive visualizations, one cataloging political advertisements in selected markets, another tracking how President Obama’s State of the Union address was excerpted by television news channels around the world and the third repeating that same tracking process for the First Republican Debate. Uniquely, these visualizations took datasets the Archive had already produced and made them accessible to journalists, scholars and the general public by creating easy-to-use interfaces around them.

The combination of these three efforts showcased the importance of interface. It wasn’t enough for the Archive to simply publish datasets or making secure computing infrastructure available: to truly serve researchers, the Archive would need to create easy-to-use interactive interfaces to all of this data. Importantly, these experiences emphasized that the extreme size and scale of the Archive placed it beyond the reach even of the rarified subset of the data science community accustomed to working with large datasets.

That finding led the following year to the Television Explorer, which offered advanced keyword search and a range of premade instant analytic visualizations designed specifically for the needs of journalists and scholars and refined over time based on their feedback. This proved a pivotal moment in unlocking the power of the Archive for journalists and fact checkers, leading to a flood of research and articles enabled by this new interface.

In 2020, the AI TV Explorer launched, allowing keyword search of all onscreen text for selected channels, as well as visual search for around 20,000 common objects and activities. In 2021, the complementary Radio Ngrams collection launched, transforming the Internet Archive’s holdings of 26 billion words of radio news spanning 550 stations over half a decade into a resource for scholars.

With the COVID-19 pandemic in 2020, public health authorities, fact checkers, journalists and researchers quickly became overwhelmed with the “infodemic” of falsehoods that filled the information vacuum and struggled how to best communicate science’s ever-evolving understanding of the pandemic to a weary and frightened public. In August of that year, the Media-Data Research Consortium (M-DRC), with whom GDELT works closely to analyze television news, was awarded a Google Cloud COVID-19 Research Grant of Google Cloud Platform credits to support "Quantifying the Covid-19 Public Health Media Narrative Through TV & Radio News Analysis,” enabling the application of  computer vision, OCR, ASR and NLP tools to a decade of past disease outbreaks captured in the Archive which were then integrated into the AI TV Explorer.

Earlier this year, in the Archive’s 10th anniversary, it explored for the first time archiving a major global event in realtime: Russia’s invasion of Ukraine. A selection of Belarusian, Russian and Ukrainian television news channels was rapidly added to the Archive’s collections in conjunction with subject matter experts and made available in near-realtime to journalists and scholars from around the world to understand how the war was being told through the medium of television news. Importantly, by capturing domestic Russian television news, journalists could examine the parallel narrative universe being constructed for its domestic audience.

This unprecedented new collection carried with it similarly unprecedented new challenges. Unlike US television news, none of this new content was closed captioned, making it inaccessible to the keyword captioning search at the center of the Archive’s public interface. No comparable public access television news archives had ever experimented with multilingual speech recognition at this scale, meaning there were no blueprints for collection-scale transcription across multiple languages of this magnitude.

Most importantly, however, war coverage is fundamentally different from the kinds of research that the Archive had been used for in the past: the visual narrative of war is often as important or even more important than the spoken one. This meant that we needed a fundamentally new kind of interface to the Archive that would allow journalists and scholars to rapidly visually search months of coverage across multiple channels looking for subtle nuances and highly contextualized portrayals that are beyond the capabilities of current AI tools. In the end, we created a fundamentally new kind of visual interface to television news called the TV Visual Explorer that for the first time made television news “skimmable.” In the months since its release, the Visual Explorer has been expanded to an ever-growing portion of the Archive, reaching today more than 885,000 broadcasts from 34 channels spanning 15 countries and territories in 12 languages and dialects over the past decade.

All of this pioneering work has come over a decade of collaborations between the Television News Archive and GDELT, and more recently the M-DRC, creating fundamentally new kinds of research, infrastructure and interfaces that have opened the Archive's vast holdings to journalists and scholars, in the process helping forge entirely new understandings of how to make data of this size and scale accessible to researchers.

Unlike the keyword search engines and Excel-sized datasets journalists and scholars have been long familiar with, these new global-scale datasets and fundamentally new kinds of AI and visual annotations represent a profound departure from how most reporters and researchers think about media. The idea of visually searching television or opening up onscreen text as a rich new modality requires not just new ways of thinking about media, but fundamentally new tooling and workflows that require hand-in-hand collaborations with journalists and scholars. Our 2021 collaboration with First Draft in studying the role of cable television news in amplifying falsehoods involved hand-in-hand collaboration to translate their research question (how presidential tweets drive the news cycle) into an entirely new technical and methodological workflow of combining video AI annotations with social media archives to enable them to quantify for the first time the degree to which Donald Trump was able to drive the news cycle by tweet. The New York Times’ landmark Tucker Carlson analysis was enabled through new researcher-specific pipelines that permitted them to conduct a media analysis of unprecedented scale. That collaboration helped inform key design principles for the Visual Explorer that will now make it possible for other journalism teams to embark upon similar analyses.

Through all of this, GDELT has worked closely with the Archive and M-DRC to collaborate hand-in-hand with journalists and scholars from across the world to help them ask previously impossible questions, working together to innovate new interfaces, tools, datasets, methodologies and workflows to answer them, then translating those lessons into an easy template for others to follow, taking once-impossible questions and making them routine, much as we took the once-unimaginable idea of television captioning search and turned it into the intuitive and non-technical TV Explorer. Through GDELT's collaborations, it has transformed were once bespoken endeavors requiring massive computing resources and months of research into established tools used everyday by scholars and journalists following  in those footsteps.

GDELT's fundamental work in exploring the application of video, speech, still image and textual AI approaches to annotating and enriching global television news at collection scale has helped forge new opportunities and seed new research possibilities, both in proving the feasibility of applying audiovisual AI at scale to television news and the immense potential of such annotations for scholars and journalists. Most recently, following the same template pioneered by GDELT with the Television News Archive, Vanderbilt Television Archive, the very first scholarly television news archive, announced its own new initiative to create automated transcripts of its archives, expanding GDELT's model to add customized speech recognition models tailored by year. A growing number of other major audiovisual archives have also cited GDELT's work in launching their own AI-powered innovation programs for their archives. In this way, GDELT increasingly serves as a centralized knowledge facilitator and innovation center driving innovation and new approaches to television news research that are already leading to major new research opportunities.

In short, over the past decade, GDELT has not only fundamentally reimagined the possible in media research, it has seeded new innovation for the field at large, conducting both pioneering new research that has transformed what is achievable and created entirely new visions for audiovisual research, but also created the tools, interfaces and technical workflows that have made it possible for journalists and scholars across the world to harness these immense leaps forward in their own work.

As the web increasingly transitions to a video-first medium, the TV Archive contains a wealth of insights and lessons into how stories have been told through video over the past nearly quarter-century and how those visual and spoken storytelling approaches differ across the world and over time. Much as emblem books offer a window into how complex and nuanced topics were concisely conveyed centuries ago, television news offers a window into how the world has used the complex medium of video to tell stories from the complex and technical to the lighthearted and aspirational to the political and divisive all across the world.

Within this massive combined archive of global news coverage spanning text, imagery, audio and video lie the stories and narratives that have defined the 21st century. From the global events that captivated the world to the infinite stories that received more local attention, to existential changes in news consumption like the rise of social media, the digital revolution and algorithmic filter bubbles, this combined archive is essentially a catalog of planet earth itself. Yet its immense size and scale means much remains to be done to unlock the full potential of this vast archive for scholars and journalists.

As the Television News Archive takes its first steps into its second decade, it is enabling journalism and scholarship of the fourth estate which forms not only the lens through which the public sees the world, but the very bedrock of the democratic system increasingly under attack across the world. In its first decade, the Television News Archive built the infrastructure to collect, archive and preserve television news from around the world and unlocked the power of this vast and incredibly unique archive for journalists and researchers from across the world.

Looking back on all we've accomplished together over the past decade, the TV Archive's second decade promises to fundamentally transform and reimagine how journalists, scholars, policymakers and the public are able to make sense of the world around us.