Visual Explorer: OCR'ing A Year And A Half Of CSPAN Through Tesseract

The onscreen text of television news yields powerful insights into the content of their broadcasts, from identifying who is speaking to cataloging the day's major stories to editorializing and contextualizing stories as they are told. Extracting this text into a searchable and analyzing form requires the use of OCR software, but television news presents a worst-case scenario for OCR with its vast array of fonts, sizes, colors, backgrounds, textures, motion blurring and other artifacts and artistic choices that together form a rich tapestry of text in each broadcast frame. Modern commercial OCR tools yield nearly flawless OCR on this content, while the results from open source tools like Tesseract still struggle significantly with such content. To seed further research into the potential new kinds of insights that could be derived by searching and analyzing the onscreen text of our nation's governance using open OCR tools, today  in collaboration with the Internet Archive's TV News Archive and the multi-party Media-Data Research Consortium, we are releasing a new dataset of nearly a year of a half of Tesseract OCR'd text from CSPAN, running January 1, 2022 through April 30, 2023, applying Tesseract to each image from the every-4-seconds Visual Explorer preview images. In all, 11,192 broadcasts totaling 10,375,897 images representing 41.5 million seconds of airtime were OCR'd by Tesseract yielding 1.5GB of JSON containing 472MB of OCR'd text. In contrast to the nearly flawless results of commercial cloud OCR APIs, the results of Tesseract's open source OCR range in quality from reasonable to usable to unusable and we hope that by releasing this experimental dataset, we can both spur new research into at-scale OCR analysis and seed new work by the open source OCR community into the needs of video OCR.

The results are made available as a JSONNL file for each broadcast, consisting of one row per Visual Explorer image, containing the Tesseract OCR output for that image. Each row contains two fields: "ocr" contains the Tesseract output for that image and "frame" is the image frame that was OCR'd from the Visual Explorer images for that broadcast.

Take this CSPAN broadcast from April 29th titled "Secs. Blinken & Mayorkas Remarks on Expiration of Title 42" – here are the first few lines of the output:

{"ocr":"st e\n2X comcast dish\nniece Ae=BUCKEYE Mediacom)\n\nBROADBAND\n\nfios’ WOW! Charter\nby verizon x) nan /Sparklight\n\nU-verse C vyve ]\n\n","frame":"CSPAN_20230429_201500_Sec
{"ocr":"*\nIX comcast\nDIREC “@> BUCKEYE\nfios’ WC\nby verizon\n\\) mipco\noptimum. Usverse\n\n/Sparklight\n\n","frame":"CSPAN_20230429_201500_Secs._Blinken__Mayorkas_Remarks_on_Expir
{"ocr":"Me e\nDX comcast dish\n\nDIRECTV @SBUCKEYE Mediacom)\nfios’ WC\nby verizon\n\n\\} minco’ /Sparklight\n\nU-verse C vyve J\n\n","frame":"CSPAN_20230429_201500_Secs._Blinken__May
{"ocr":"NEXT\n\nTitle 42 Ending & Border Security\n\nSec. Mayorkas Sec. Blinken\n","frame":"CSPAN_20230429_201500_Secs._Blinken__Mayorkas_Remarks_on_Expiration_of_Title_42-000009"}
{"ocr":"NEXT\n\nTitle 42 Ending & Border Security\n\nSec. Mayorkas Sec. Blinken\n","frame":"CSPAN_20230429_201500_Secs._Blinken__Mayorkas_Remarks_on_Expiration_of_Title_42-000010"}
{"ocr":"NEXT\n\nTitle 42 Ending & Border Security\n\nSec. Mayorkas Sec. Blinken\n","frame":"CSPAN_20230429_201500_Secs._Blinken__Mayorkas_Remarks_on_Expiration_of_Title_42-000011"}
{"ocr":"NEXT\n\nTitle 42 Ending & Border Security\n\nSec. Mayorkas Sec. Blinken\n","frame":"CSPAN_20230429_201500_Secs._Blinken__Mayorkas_Remarks_on_Expiration_of_Title_42-000012"}
{"ocr":"NEXT\n\nTitle 42 Ending & Border Security\n\nSec. Mayorkas Sec. Blinken\n","frame":"CSPAN_20230429_201500_Secs._Blinken__Mayorkas_Remarks_on_Expiration_of_Title_42-000013"}
{"ocr":"NEXT\n\nTitle 42 Ending & Border Security\n\nSec. Mayorkas Sec. Blinken\n","frame":"CSPAN_20230429_201500_Secs._Blinken__Mayorkas_Remarks_on_Expiration_of_Title_42-000014"}
{"ocr":"i\nJ IMMIGRATION\n\nState Department\nWashi oc\n\n","frame":"CSPAN_20230429_201500_Secs._Blinken__Mayorkas_Remarks_on_Expiration_of_Title_42-000021"}
{"ocr":"a | ut State Department Be cSPAN]\n","frame":"CSPAN_20230429_201500_Secs._Blinken__Mayorkas_Remarks_on_Expiration_of_Title_42-000022"}
{"ocr":"ANTONY BLINKEN\n| Secretary of State\nSSS\n\n=\n","frame":"CSPAN_20230429_201500_Secs._Blinken__Mayorkas_Remarks_on_Expiration_of_Title_42-000027"}

Some frames like "NEXT\n\nTitle 42 Ending & Border Security\n\nSec. Mayorkas Sec. Blinken\n" are completely accurate, while others contain errors like "Washi oc" that complicate search and analysis. Overall, proper names tend to often be extracted accurately at least once in a given sequence, though not always for all broadcasts.

To download a given broadcast's Tesseract OCR file, download it via the following, replacing "SHOWID" with the given show identifier:


You can pair this with any JSON reader, such as the command line "jq" to extract the text:

apt-get -y install jq
curl -q | jq -r .ocr > OUT.TXT

Skimming the file above, immediately clear is that phrases "ANTONY BLINKEN /  Secretary of State" and "ALEJANDRO MAYORKAS / Homeland Security Secretary" dominate the output, giving a strong signal as to who is speaking in the broadcast. In fact, "Secretary of State" appears 166 times, with "Secretary of Stato" (typo in the last character) appearing an additional 14 times, with "Homeland Security Secretary" appearing 265 times, giving an instant estimate of their respective speaking times in the broadcast. In other words, even with no visual or facial analysis whatsoever, just from the OCR'd text we can estimate that Mayorkas received significantly more time than Blinken in this press conference, assuming that the chyron text was properly updated to reflect who was speaking.

You can download all available OCR files for a given date range using the code below and setting the start and end dates in the first line in YYYYMMDD format:

apt-get -y install parallel
apt-get -y install jq
start=20230305; end=20230311; while [[ ! $start > $end ]]; do echo $start; start=$(date -d "$start + 1 day" "+%Y%m%d"); done > DATES
mkdir INV
time cat DATES | parallel --eta 'wget -q{}.inventory.json -P ./INV/'
rm IDS; find ./INV/ -depth -name '*.json' | parallel --eta 'cat {} | jq -r .shows[].id >> IDS';
rm -f INV
mkdir OCR
time cat IDS | parallel --eta 'wget -q{}.tocr.json -P ./OCR/'

If you don't care about what frame each block of text came from and just want a single massive text blob per broadcast, you can then run:

time find ./OCR/ -depth -name '*.tocr.json' | parallel --eta 'cat {} | jq -r .ocr > {.}.txt'

This will create a ".txt" file for each broadcast containing all of its OCR'd text from all frames concatenated together in frame order that can then be imported into various text analysis tools. The results are often passable, but can suffer from heavy OCR error significant enough to make analysis difficult, such as the text below from this broadcast.

= | Wadley

—— eos

ne eon”

= |e ot ey
E:| PRESIDENT ° ie a |

2OT-ON-CRIME H jae REareied

Overall, Tesseract's OCR error on CSPAN is substantial enough that it impedes many forms of analyses, but there is typically sufficient recoverable text that, with proper experimental design, sufficiently robust tooling and creative workarounds, interesting insights may be derived.

For those interested in scaling the these results to other time ranges and channels, you can find the complete workflow we used below.

First, spin up a new VM. For this experiment we used a cluster of 64-core VMs to make parallelization easy and to run the entire process entirely in the RAM disk of each machine to mitigate IO limitations.

To install Tesseract, we followed these instructions to install the precompiled 5.3.1 binary from an alternative repository that was not yet available on the latest cloud Debian release of our VMs (NOTE however that there are considerable security risks of installing insecure repositories and production applications would want to install the latest release from source if necessary):

apt-get install apt-transport-https
echo "deb$(lsb_release -cs)/ $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/notesalexp.list > /dev/null
apt-get update -oAcquire::AllowInsecureRepositories=true
apt-get install notealexp-keyring -oAcquire::AllowInsecureRepositories=true
apt-get install tesseract-ocr

Now we install any language packages we need (NOTE: you may get warnings about the dangers of insecure repositories when installing the packages below – production applications would want to install via trusted repositories):

apt-get install tesseract-ocr-eng
apt-get install tesseract-ocr-rus
apt-get install tesseract-ocr-fas
apt-get install tesseract-ocr-chi-sim
apt-get install tesseract-ocr-chi-tra
tesseract --list-langs

OCR'ing a broadcast requires downloading its Visual Explorer image ZIP, unpacking it, OCRing it across all available CPUs and compiling the OCR output into a single JSON file per broadcast. Here we do all of that in a single shell script.

Install the necessary prerequisites after installing Tesseract above:

apt-get -y install parallel
apt-get -y install jq
apt-get -y install unzip

Save the following file as "":

wget -q$ -P ./IMAGES/
unzip -n -q -d ./IMAGES/ "./IMAGES/$"
rm -f "./IMAGES/$"
find "./IMAGES/$1/" -depth -name "*.jpg" | parallel 'tesseract {} {.} -l eng --oem 1 --psm 3 > /dev/null 2>&1'
parallel -j 1 'jq --arg frame "{1/.}" -cRs "{\"ocr\": . , \"frame\": \$frame }" {1} >> "./OCR/{2}.tocr.json"' ::: $(find "./IMAGES/$1/" -depth -name "*.txt" | sort) ::: $1
rm -rf "./IMAGES/$1/"

Then make it executable:

chmod 755

You can then run this script with the ID of any English-language broadcast in the Visual Explorer:

time ./ CSPAN_20230430_030800_Guests_Arrive_for_2023_White_House_Correspondents_Dinner

You can switch to other languages by installing the necessary Tesseract language pack and then changing "-l eng" in the script to the respective language code.

By default, each Tesseract instance uses 4 threads, while GNU parallel above runs one Tesseract instance per core. When running a single broadcast or even two broadcasts on a larger machine, this can be beneficial by overlapping the hot cores in Tesseract's processing with IO and setup/teardown tasks. When pushing a machine closer to saturation, however, this incurs enormous context switching and related overhead that can vastly slow down processing, to the point that it can take 40-60 minutes per broadcast rather than 1-2 minutes or less.

When running large numbers of Tesseract processes it is far more efficient to limit each process to a single thread and run one process per core. This can be effected through an environment variable:


On a 64-core VM, to OCR an entire month of CSPAN, we used:

start=20220101; end=20220131; while [[ ! $start > $end ]]; do echo $start; start=$(date -d "$start + 1 day" "+%Y%m%d"); done > DATES
rm -rf JSON; mkdir JSON
time cat DATES | parallel --eta 'wget -q{}.inventory.json -P ./JSON/'
rm IDS; rm IDS.tmp; find ./JSON/ -depth -name '*.json' | parallel --eta 'cat {} | jq -r .shows[].id >> IDS.tmp'; sort IDS.tmp > IDS
wc -l IDS
rm PAR.LOG; cat IDS | parallel -j 8 --resume --joblog ./PAR.LOG './ {}'&

This runs 8 of our shell scripts in parallel, each of which runs 64 instances of Tesseract (one per core) for a total of 512 instances of Tesseract running on 64 cores. In theory, this heavy oversubscription should result in high amounts of wasted CPU as the kernel has to manage a large number of threads. In practice, we found this yielded the highest throughput at scale on the specific CPU architecture of our 64-core VMs in GCE due to the fact that each of the 8 scripts tended to be running in different states at different times over time, with some downloading or unpacking a new image set, another compiling the results into JSON, etc, so in actuality this allowed us to overlap intense CPU, setup and IO tasks.

We are excited to see what kinds of fundamentally new analyses and OCR research this new dataset helps seed. Please reach out with any questions.