![](https://blog.gdeltproject.org/wp-content/uploads/2022-tve-visual-channel-comparer-all-1064x410.png)
Yesterday we previewed some initial statistics from our work identifying and removing advertisements from closed captioning transcripts across the TV News Archive. While the majority of the Archive consists of MPEG2 streams that contain the captioning mode information necessary to delineate advertisements, a subset of broadcasts, especially local channels collected across the nation during major election cycles to capture a fuller picture of the nation's communities, were collected through alternate technical workflows that yielded standalone SRT files that lack such captioning information. Thus, the most complete representation of human captioning in the Archive comes in the form of these SRT files, which we are supplementing by the new compiled TTXT files identifying advertising. What does this massive archive look like in its totality?
In total, there are 2.068 million captioned broadcasts totaling 6.795 billion seconds (113M minutes / 1.89M hours) of airtime over the Archive's 17 years of captioning. Since around 60% of advertisements are uncaptioned, of this total, 5.54 billion seconds of airtime (92.3M minutes / 1.54M hours) recorded at least one uttered word in the captioning. In total, there are 74 billion characters of captioning totaling 13.8 billion words and 3 billion captioning lines. The average captioned word over the past 17 years is 5.34 characters long, with an average speaking rate of 2.5 words per second (150 words per minute), with the average captioning line containing 4.6 words. While we are currently working on language statistics, the vast majority of the captioned archive is in English, with a small subset in Spanish.
While the TV News Archive itself collected its first broadcasts in 2001, the Archive's closed captioning data begins in 2009 and ramped up rapidly through 2012, where it has largely remained since, with the exception of a decrease in 2021, as seen in the timeline below. Blue represents the sum total of the maximum timestamp of all SRT files by year. Orange represents the total seconds in which at least one word was recorded in the captioning (since around 60% of advertisements are not captioned and thus are blank in the captioning file):