The GDELT Project

Experiments With Television News OCR: Tesseract Vs Cloud Vision

Television news across the world (though not in all countries) often contains significant amounts of onscreen text, from contextualizing chyrons to the names of interviewees to news scrolls of breaking events. This text can be OCR'd to generate a secondary searchable textual feed alongside traditional spoken word captioning. Unlike most other traditional OCR tasks, television news OCR tends to offer a worst-case scenario, with a pathological diversity of text colors, text fonts, text sizes, text orientations, background colors and gradients and even transparent overlays, motion blur, fade in/out and other animation effects and myriad other artifacts that are simply not a factor in traditional OCR. How well do state-of-the-art commercial and open source OCR systems perform?

For SOTU commercial OCR, we'll examine GCP's Cloud Vision API, while for open source OCR we'll use Tesseract. Both tools support multilingual OCR in a range of languages, with Google supporting 133 languages supported today, along with handwritten OCR for 9 languages.

GCP's Cloud Vision API is a hosted API, so there is nothing to install locally. For Tesseract, we can install like any other package:

apt-get -y install tesseract-ocr
tesseract --list-langs

At the moment, this installs version 4.1.1 on current Debian systems. For the purposes of testing the latest version of Tesseract, we created a temporary VM and followed these instructions to download a precompiled version of the latest 5.x version:

apt-get install apt-transport-https
echo "deb https://notesalexp.org/tesseract-ocr5/$(lsb_release -cs)/ $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/notesalexp.list > /dev/null
apt-get update -oAcquire::AllowInsecureRepositories=true
apt-get install notesalexp-keyring -oAcquire::AllowInsecureRepositories=true
apt-get install tesseract-ocr

This yields the latest version:

tesseract -v
tesseract 5.3.1
leptonica-1.79.0
libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0
Found AVX512BW
Found AVX512F
Found AVX512VNNI
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
Found libcurl/7.74.0 OpenSSL/1.1.1n zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh2/1.9.0 nghttp2/1.43.0 librtmp/2.3

We have to install Tesseract's language packs separately. To install support for Russian, we use:

apt-get -y install tesseract-ocr-rus

Note that unlike Cloud Vision, Tesseract requires us to specify the list of languages to search for text in for the image, meaning we have to apriori know the complete list of languages to expect in each image. For television news we typically know the dominate language(s) of a given broadcaster, but this does mean that we will miss unexpected language occurrences, such as Urdu text appearing on an ABC Evening News broadcast.

To analyze a single image through Cloud Vision we can use:

time gsutil -m -q cp "./IMAGE.jpg" gs://[YOURBUCKET]/
curl -s -H "Content-Type: application/json; charset=utf-8" -H "x-goog-user-project:[YOURPROJECTID]" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://vision.googleapis.com/v1/images:annotate -d '{ "requests": [ { "image": { "source": { "gcsImageUri": "gs://[YOURBUCKET]/IMAGE.jpg" } }, "features": [ {"type":"TEXT_DETECTION"} ] } ] }' | jq -r .responses[].fullTextAnnotation.text

While to analyze a single image through Tesseract we can use:

time tesseract ./IMAGE.jpg output -l eng --oem 1 --psm 3; cat output.txt

To analyze an entire directory of images through Cloud Vision we can use:

wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA24_20230306_123000_RIK_Rossiya_24.zip
unzip RUSSIA24_20230306_123000_RIK_Rossiya_24.zip
time gsutil -m -q cp "./RUSSIA24_20230306_123000_RIK_Rossiya_24/*.jpg" gs://[YOURBUCKET]/
mkdir TXT_CV
rm PAR.LOG
time find ./RUSSIA24_20230306_123000_RIK_Rossiya_24/ -depth -name "*.jpg" | parallel -j 7 --resume --joblog ./PAR.LOG --eta "[ ! -f ./TXT_CV/{/.}.json ] && curl -s -H \"Content-Type: application/json; charset=utf-8\" -H "x-goog-user-project:[YOURPROJECTID]" -H \"Authorization: Bearer $(gcloud auth print-access-token)\" https://vision.googleapis.com/v1/images:annotate -d '{ \"requests\": [ { \"image\": { \"source\": { \"gcsImageUri\": \"gs://[YOURBUCKET]/{/}\" } }, \"features\": [ {\"type\":\"TEXT_DETECTION\"} ] } ] }' > ./TXT_CV/{/.}.json"

Using the limit above of 7 images in flight at any moment (you can increase this immensely depending on your quota), this takes just 41 seconds.

While for Tesseract we can use:

wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA24_20230306_123000_RIK_Rossiya_24.zip
unzip RUSSIA24_20230306_123000_RIK_Rossiya_24.zip
mkdir TXT_TES
rm PAR.LOG
time find ./RUSSIA24_20230306_123000_RIK_Rossiya_24/ -depth -name "*.jpg" | parallel --resume --joblog ./PAR.LOG --eta 'tesseract {} ./TXT_TES/{/.} -l rus --oem 1 --psm 3'

This takes around 16 minutes on a 16-core C2 VM.

Let's take a typical frame from a March 6th Russia 24 broadcast, in which there is is a top text scroll of blue text over yellow background, bordered by black text over white, white text over dark blue, white text over light blue and white text over red, with two levels of bottom text, including white on medium-dark blue and black on white, with a bold red textual stamp towards the lower-left of the frame with transparent background over textured blue. This diverse assortment of text colors, background colors, fonts and pointsizes is typical for the broadcast medium and presents a complicated scenario for OCR, with multiple zones with very different recognition requirements.

Let's extract Cloud Vision's transcription of this frame:

cat RUSSIA24_20230306_123000_RIK_Rossiya_24-000382.json | jq -r .responses[].fullTextAnnotation.text

This yields (here we're extracting the final full-frame text as a blob, without incorporating location information):

15:54 КРЕМЛЬ: и Раиси позитивно оценили уровень и динамику отношений РФ и Ирана
РОССИЯ 24
UA
СТОП ФЕЙК
"РОСКОСМОС": ГРУЗОВИК "ПРОГРЕСС МС-22" УВЕЛ МЕЖДУНАРОДНУЮ КОСМИЧЕСКУЮ СТАНЦИЮ ОТ КОСМИЧЕСКОГО МУСОРА
М.Мишустин: Туристический сектор должен стать опорным направлением развития СКФО

Overall, this is an extremely accurate transcription of the frame despite the pathological conditions. Google Translate translates as:

15:54 KREMLIN: and Raisi positively assessed the level and dynamics of relations between Russia and Iran
RUSSIA 24
U.A.
STOP FAKE
ROSCOSMOS: THE PROGRESS MS-22 TRUCK SEEKED THE INTERNATIONAL SPACE STATION FROM SPACE DEBRIS
Mikhail Mishustin: The tourism sector should become the backbone of the development of the North Caucasus Federal District

Now let's try Tesseract using its default settings:

time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382.jpg output -l rus --oem 1 --psm 3; cat output.txt

This yields:

15:54 УЗИ = Раиси позитивно оценили уровеньи динамику отношений РФ и Ирана [8.1

тд

Ш.
"| СТОП ФЕЙ

т ь .
РОСКОСМОС*: ГРУЗОВИК "ПРОГРЕСС МС-22" УВЁЛ МЕЖДУНАРОДНУЮ КОСМИЧЕСКУЮ СТАНЦИЮ ОТ КОСМИЧЕСКОГО МУСОРА.
М. мишустин: Туристический сектор должен стать опорным направлением развития СКФО

This is much less accurate and translates via Google Translate as:

15:54 UZI = Raisi positively assessed the level and dynamics of relations between the Russian Federation and Iran [8.1
td Sh. "| STOP FEI t .
ROSCOSMOS *: TRUCK "PROGRESS MS-22" SEEKED INTERNATIONAL SPACE STATION FROM SPACE GARBAGE.
M. Mishustin: Tourism sector should become the backbone of the development of the North Caucasus Federal District

By default, Tesseract uses its own Otsu thresholding – what if we instead use Leptonica's Ostu implementation via the thresholding_method parameter?

time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382.jpg output -l rus --oem 1 --psm 3 -c thresholding_method=1; cat output.txt

This yields slightly different results:

№
но оценили уровеньи динамику отношений РФ и Ирана РОССИЯ. 2а
= | : Е НР —

_ ие

и

*РОСКОСМОС": ГРУЗОВИК "ПРОГРЕСС МС-22" УВЁЛ МЕЖДУНАРОДНУЮ КОСМИЧЕСКУЮ СТАНЦИЮ ОТ КОСМИЧЕСКОГО МУСОРА
м. мишустин: Туристический сектор должен стать опорным направлением развития СКФО

Г г. Е
СТОП ФЕЙК
8

Which translates as:

No.
but assessed the level and dynamics of relations between the Russian Federation and Iran RUSSIA. 2a
= | : E HP -

_ no

And

*ROSCOSMOS": TRUCKS "PROGRESS MS-22" SEEKED INTERNATIONAL SPACE STATION FROM SPACE DEBRIS
M. Mishustin: The tourism sector should become the backbone of the development of the North Caucasus Federal District

G G. E
STOP FAKE
8

Or Sauvola thresholding:

time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382.jpg output -l rus --oem 1 --psm 3 -c thresholding_method=2; cat output.txt

Which yields:

15:54 УЗИ = Раиси позитивно оценили уровеньи динамику отношений РФ и Ирана [8.1

тд

Ш.
"| СТОП ФЕЙ

т ь .
РОСКОСМОС*: ГРУЗОВИК "ПРОГРЕСС МС-22" УВЁЛ МЕЖДУНАРОДНУЮ КОСМИЧЕСКУЮ СТАНЦИЮ ОТ КОСМИЧЕСКОГО МУСОРА.
М. мишустин: Туристический сектор должен стать опорным направлением развития СКФО

Which translates to:

15:54 UZI = Raisi positively assessed the level and dynamics of relations between Russia and Iran [8.1

td

Sh.
"| STOP FEY

t b .
ROSCOSMOS*: TRUCK "PROGRESS MS-22" SEEKED INTERNATIONAL SPACE STATION FROM SPACE DEBRIS.
M. Mishustin: The tourism sector should become the backbone of the development of the North Caucasus Federal District

This yields the closest match to Cloud Vision's OCR (though still with key errors), demonstrating the critical importance of thresholding when using Tesseract.

What if we preprocess using ImageMagick and convert to greyscale before handing to Tesseract?

apt-get -y install imagemagick
convert ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382.jpg -type Grayscale ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382-thresh.png
time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382-thresh.png output -l rus --oem 1 --psm 3; cat output.txt

This yields:

[15:54] КРЕМЛ иси позитивно оценили уровеньи динамику отношений РФ и Ирана ИИ 8/2

|
*РОСКОСМОС*: ГРУЗОВИК "ПРОГРЕСС МС-22" УВЁЛ МЕЖДУНАРОДНУЮ КОСМИЧЕСКУЮ СТАНЦИЮ ОТ КОСМИЧЕСКОГО МУСОРА

м. мишустин: Туристический сектор должен стать опорным направлением развития СКФО

Translating to:

[15:54] The KREMLIS positively assessed the level and dynamics of relations between Russia and Iran AI 8/2

|
*ROSCOSMOS*: TRUCK "PROGRESS MS-22" SEEKED INTERNATIONAL SPACE STATION FROM SPACE DEBRIS

M. Mishustin: The tourism sector should become the backbone of the development of the North Caucasus Federal District

This yields even better results than our best native Tesseract results, again reinforcing the sensitivity of Tesseract to the thresholding process.

What if we try a Gaussian filter and resizing to 150%, which we found in previous versions of Tesseract to noticeably improve recognition accuracy?

convert ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382.jpg -type Grayscale -filter Gaussian -resize 150% ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382-thresh.png
time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382-thresh.png output -l rus --oem 1 --psm 3; cat output.txt

This yields:

КРЕМЛЬ: ›аиси позитивно оценили уровень и динамику отношений РФ и Ирана [8:7
щи

=> №. |
*РОСКОСМОС*: ГРУЗОВИК "ПРОГРЕСС МС-22* УВЕЛ МЕЖДУНАРОДНУЮ КОСМИЧЕСКУЮ СТАНЦИЮ ОТ КОСМИЧЕСКОГО МУСОРА

м. мишустин: Туристический сектор должен стать опорным направлением развития СКФО

Which translates to:

KREMLIN: ›AISI positively assessed the level and dynamics of relations between Russia and Iran [8:7
cabbage soup

=> No. |
*ROSCOSMOS*: TRUCK "PROGRESS MS-22* DRIVED THE INTERNATIONAL SPACE STATION FROM SPACE DEBRIS"

M. Mishustin: The tourism sector should become the backbone of the development of the North Caucasus Federal District

This time it captures Kremlin properly, but at the cost of adding some additional noise. If we don't convert to greyscale first, the results are even worse.

What about a different frame from that broadcast with more text?

Let's look at Cloud Vision's transcription:

cat RUSSIA24_20230306_123000_RIK_Rossiya_24-000154.json | jq -r .responses[].fullTextAnnotation.text

Which yields:

15.39 КРЕМЛЬ: дальнейших контактов
Состоялся телефонный разговор Владимир РОССИЯ 24
ВОДОРОДНАЯ ПАЛИКОМ СОЗДАЮТ
И
ЭНЕРГЕТИКА
КОНСОРЦИУМ ДЛЯ
ПРОДВИЖЕНИЯ
ОБОРУДОВАНИЯ В СФЕРЕ
ВОДОРОДНОЙ
ЭНЕРГЕТИКИ
НОВЫЕ УСТАНОВКИ БУДУТ
ПРИМЕНЯТЬСЯ
В МИКРОЭЛЕКТРОНИКЕ,
МЕТАЛЛУРГИИ
И ДРУГИХ НАПРАВЛЕНИЯХ
М. Мишустин: Туристический сектор должен стать опорным направлением развития СКФО

Translated to:

15.39 KREMLIN: further contacts
A telephone conversation took place Vladimir RUSSIA 24
HYDROGEN PALICOM CREATE
AND
ENERGY
CONSORTIUM FOR
PROMOTIONS
EQUIPMENT IN THE SPHERE
HYDROGEN
ENERGY
NEW INSTALLATIONS WILL BE
APPLY
IN MICROELECTRONICS,
METALLURGY
AND OTHER DIRECTIONS
M. Mishustin: The tourism sector should become the backbone of the development of the North Caucasus Federal District

Let's try Tesseract with its default settings:

time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000154.jpg output -l rus --oem 1 --psm 3; cat output.txt

Which yields just a small fraction of the text:

15.39 АЯ ЗМ" альнейших контактов

я,

м. мишустин: Туристическ ектор должен стать о

щ
* ,) 2

Состоялся телефонный разговор Владимир: |:{18(8%}] 2а

ВОДОРОДНАЯ
ЭНЕРГЕТИКА

сы НОВЫЕ УСТАНОВКИ БУДУТ
- ПРИМЕНЯТЬСЯ
* ВМИКРОЭЛЕКТРОНИКЕ,
"* МЕТАЛЛУРГИИ
И ДРУГИХ НАПРАВЛЕНИЯХ

ным направлением развития СКФО

Even thresholding fails to improve the results much:

time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000154.jpg output -l rus --oem 1 --psm 3 -c thresholding_method=2; cat output.txt

Which yields:

№ 15.9 ЗАЗ ШЕ альнейших контактов Состоялся телефомный разговор Владимир риы

вод ОР од АЯ "РУСАТОМ ОВЕРСИЗ"
И "ПОЛИКОМ" СОЗДАЮТ
КОНСОРЦИУМ ДЛЯ
_ ЭН НЕРГЕТИКА ПРОДВИЖЕНИЯ
ОБОРУДОВАНИЯ В СФЕРЕ

ВОДОРОДНОЙ
ЭНЕРГЕТИКИ

УРГИИ
И ДРУГИХ НАПРАВЛЕНИЯХ

м. мишустин: Туристический сектор должен стать опорным направлением развития СКФО

Translated as:

No. 15.9 ZAZ SHEE of further contacts A telephone conversation took place Vladimir riy

water OR OD AYA "RUSATOM OVERSEAS"
AND "POLICOM" CREATE
CONSORTIUM FOR
_ ENERGY PROMOTION
EQUIPMENT IN THE SPHERE

HYDROGEN
ENERGY

URGII
AND OTHER DIRECTIONS

M. Mishustin: The tourism sector should become the backbone of the development of the North Caucasus Federal District

Grayscale conversion similarly fails to improve the results by much.

What about this example of a large single block of text, with part of it highlighted?

Once again, Cloud Vision recovers it extremely well:

cat RUSSIA24_20230306_123000_RIK_Rossiya_24-000245.json | jq -r .responses[].fullTextAnnotation.text

Yielding:

15:45 КРЕМЛь: состоялся телефонный разговор Владимира Путина с Президентом Ислам РОССИЯ 24
раненным в результате обстрела жителям, попала под повторный обстрел с
украинской стороны.
СТОП ФЕЙК
Ранее представитель силовых структур ДНР сообщал о троих погибших сотрудниках
бригады.
Как следует из данных представительства ДНР в Совместном центре контроля и
координации вопросов, связанных с военными преступлениями Украины,
украинские войска в четверг дважды обстреливали Петровский район с
применением реактивных систем залпового огня, выпустив по нему в общей
сложности 23 снаряда. Так же они открывали по нему огонь из артиллерийских
орудий натовского калибра 155 мм. В администрации района сообщали об одной
погибшей женщине и раненом мужчине из числа мирных жителей.
Теги: Украина Россия Военная операция на Украине
МО РФ: Средства ПВО России за сутки в ходе СВо сбили 15 снарядов РСЗО HIMARS И <<УparaH>>

Translated as:

15:45 KREMLIN: Vladimir Putin had a telephone conversation with the President Islam RUSSIA 24
wounded as a result of the shelling of residents, came under repeated shelling from
Ukrainian side.
STOP FAKE
Earlier, a representative of the power structures of the DPR reported about three dead employees.
brigades.
As follows from the data of the DPR representation in the Joint Center for Control and
coordination of issues related to war crimes in Ukraine,
Ukrainian troops on Thursday shelled the Petrovsky district twice from
the use of multiple launch rocket systems, firing at him in total
difficulty 23 rounds. They also opened fire on him from artillery
155 mm NATO caliber guns. The district administration reported one
a dead woman and a wounded civilian man.
Tags: Ukraine Russia Military operation in Ukraine
Russian Defense Ministry: Russian air defense systems shot down 15 MLRS HIMARS AND <<УparaH>> shells per day during the SVO

In contrast, once again Tesseract fails to recover much of the text even with the improved thresholding:

time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000245.jpg output -l rus --oem 1 --psm 3 -c thresholding_method=2; cat output.txt

Yielding:

15:45 УАНЗУЛИЕЯ состоялся телефонный разговор Владимира Путина с Президентом Исла! "РОССИЯ 2А

раненным в результате обстрела жителям, попала под повторный обстрел с
украинской стороны.

Ранее представитель силовых структур ДНР сообщал о троих погибших сотрудника;
бригады.

Как следует из данных представительства ДНР в Совместном центре контроля и

В администрации района сообщали об одной
кенщине и раненом мужчине из числа мирных жителей. 6

погибшей

Теги: Украина Россия Военная операция на Украине

№
МОРФ: Средства ПВО России за сутки в ходе СВО сбили 15 снарядов РСЗО Н!МАЯ$ и «Ураган»

Translated as:

15:45 UANZULIEYA Vladimir Putin had a telephone conversation with the President of Isla! "RUSSIA 2A

wounded as a result of the shelling of residents, came under repeated shelling from
Ukrainian side.

Earlier, a representative of the power structures of the DPR reported three dead employees;
brigades.

As follows from the data of the DPR representation in the Joint Center for Control and

The district administration reported one
a woman and a wounded civilian man. 6

deceased

Tags: Ukraine Russia Military operation in Ukraine

No.
MORF: Russian air defense systems shot down 15 MLRS N!MAY$ and Uragan shells per day during the NMD

What about this frame from a Persian-language IRINN broadcast?

First let's see how Cloud Vision transcribes the image:

wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/IRINN_20230504_033000.zip
unzip IRINN_20230504_033000.zip
time gsutil -m -q cp "./IRINN_20230504_033000/IRINN_20230504_033000-000084.jpg" gs://[YOURBUCKET]/
curl -s -H "Content-Type: application/json; charset=utf-8" -H "x-goog-user-project:[YOURPROJECTID]" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://vision.googleapis.com/v1/images:annotate -d '{ "requests": [ { "image": { "source": { "gcsImageUri": "gs://[YOURBUCKET]/IRINN_20230504_033000-000084.jpg" } }, "features": [ {"type":"TEXT_DETECTION"} ] } ] }' | jq -r .responses[].fullTextAnnotation.text

Which yields:

TW
Masih Alinejad
@Alinejad Masih
#حمیدرضا_الداعی که چند روز پیش در شهر سبزوار (شهر سربداران)، برای نجات دو دختر از
دست مزاحمین و گروهی از اراذل و اوباش با آنها درگیر شد و بر اثر اصابت چاقو جانش را از دست داده
نه بسیجی بود و نه اصلا شباهتی به بسیجی ها داشت.
به شهادت نزدیکانش او قبلا در جریان انقلاب #زن_زندگی_آزادی توسط همین بسیجی ها که امروز
جسم بی جانش را مصادره کرده اند، با شوکر مورد حمله و ضرب و شتم قرار گرفته بود.
Translate Tweet
خبر ۷
۰۷:۰۵ رئیسان جمهور برزیل و آرژانتین توافق کردند، دلار را از مبادلات تجاری خود حذف کنند

Translated to:

TW
Masih Alinejad
@Alinejad Masih
#Hamidreza_al-Da'ee who a few days ago in Sabzevar city (Sarbdaran city), to save two girls from
The hands of intruders and a group of thugs got into a fight with them and he lost his life due to a knife injury.
It was not a Basiji, nor was it similar to the Basijis at all.
According to the testimony of his relatives, he was earlier during the revolution of #Zen_Zandagi_Azadi by the same Basijs that today
His lifeless body has been confiscated, he was attacked and beaten with a stun gun.
Translate Tweet
News 7
07:05 The presidents of Brazil and Argentina agreed to remove the dollar from their trade

For Tesseract we have to install the Persian language pack:

apt-get -y install tesseract-ocr-fas

With the default settings:

time tesseract ./IRINN_20230504_033000/IRINN_20230504_033000-000084.jpg output -l fas --oem 1 --psm 3; cat output.txt

We get:

۲ 4زهصنا۸ طنو۱۸۵
6۸۵۲

ر , که چند روز پیش در شهر سبزوار (شهر سربداران)» برای نجات دو دختر از
دست مزاحمین و گروهی از اراثل و اوباش با آنها درگیر شد و بر اثر اصابت چاقو جانش را از دست داد

ار

Translated to:

2 4 Zahesna 8 Tanu 185
6852

A few days ago in Sabzevar city (Sarbdaran city) to save two girls from
The intruders and a group of Arathal and mobs got into a fight with them and he died due to a knife injury.

Using improved thresholding:

time tesseract ./IRINN_20230504_033000/IRINN_20230504_033000-000084.jpg output -l fas --oem 1 --psm 3 -c thresholding_method=2; cat output.txt

We get:

2

۱

#حمیدرضا_الداغی که چند روز پیش در شهر سبزوار (شهر سربداران)» برای نجات دو دختر از
دست مزاحمین و گروهی از اراثل و اوباش با آنها درگیر شد و بر اثر اصابت چاقو جانش را از دست داد

نه بسیجی بود و نه اصلا شباهتی به بسیجی‌ها داشت.

به شهادت نزدیکانش او قبلا در جریان انقلاب #زن_زندگی_آزادی توسط همین بسیجی ها که امروز
جسم بی جانش را مصادره کرده انده با شرکر مورد حمله و ضرب و شتم قرار گرفته بود,

) رنیسان‌جمهور برزیل و آرژانتین توا

۳

کردند. دلار را از میادلات تجاری خود حذف کنند

7777 ۲ ۲۲

Translated to:

2

1

#Hamidreza_Aldaghi who a few days ago in Sabzevar city (Sarbdaran city) to save two girls from
The intruders and a group of Arathal and mobs got into a fight with them and he died due to a knife injury.

It was not a Basiji, nor was it similar to the Basijis at all.

According to the testimony of his relatives, he was earlier during the revolution of #Zen_Zandagi_Azadi by the same Basijs that today
His lifeless body was confiscated and he was assaulted and beaten with a thug.

) Renaissance of the Republic of Brazil and Argentina

3

they did Remove the dollar from their business transactions

7777 2 22

What about this Taiwanese broadcast?

We'll first try Cloud Vision:

wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/CTV_20230504_010000.zip
unzip CTV_20230504_010000.zip
time gsutil -m -q cp "./CTV_20230504_010000/CTV_20230504_010000-000234.jpg" gs://[YOURBUCKET]/
curl -s -H "Content-Type: application/json; charset=utf-8" -H "x-goog-user-project:[YOURPROJECTID]" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://vision.googleapis.com/v1/images:annotate -d '{ "requests": [ { "image": { "source": { "gcsImageUri": "gs://[YOURBUCKET]/CTV_20230504_010000-000234.jpg" } }, "features": [ {"type":"TEXT_DETECTION"} ] } ] }' | jq -r .responses[].fullTextAnnotation.text

Which yields:

黃子佼槓天下雜誌
黃子佼
00-21更新
一直以來的筆耕。 顯示更多
|≡ 天下
獨家 · 深入星宇「心殿」 再填錢也要飛北美,張
Off婆文化
下單後價格竟五級跳!黃子
佼:網購黑膠的瞎事與驚喜
【黃子佼專欄】網路競標珍貴黑膠和
CD,其中往往暗藏陷阱。下標前要仔
細看清,避免以為賺到最後卻傷心。
文章語音、03:38
打牌天下
Sky 幫你讀文章
台北
GMA
Vore 20
encore!
字嘉蜜藏人千一年
中視新聞HD
藝人黃子佼抗議!
沒問過就擅自
改了文章的標題
甚至連內容 都大幅刪減
你放400個字跟放4000字
其實真的沒差耶
上海
寶 專欄遭天下雜誌大改!黃子佼請辭 怒轟:寫給鬼看嗎?
21-24
09:15:52 活動訊息 2023臺南國際綠色產業展 將於5/24-26日舉行

Translated to:

Huang Zi leads the world magazine
Huang Zijiao
00-21 update
All along the pen. display more
|≡ world
Exclusive · Go deep into Xingyu's "Heart Palace" and fly to North America after filling in the money, Zhang
Off-law culture
After the order was placed, the price jumped five levels! Huang Zi
Outstanding: The Stories and Surprises of Online Vinyl Shopping
[Huang Zijiao Column] Online bidding for precious vinyl and
CD, which often hides traps. Be careful before bidding
Take a closer look and avoid thinking that you will be sad in the end.
Article Voice, 03:38
playing cards
Sky reads articles for you
Taipei
GMA
Vore 20
encore!
Zijia honey Tibetan people for a thousand years
CTV News HD
Artist Huang Zijiao protests!
without asking
Changed the title of the article
Even the content has been greatly reduced
You put 400 words and put 4000 words
Actually it's not bad
Shanghai
Treasure column has been greatly changed by Tianxia Magazine! Huang Zijiao resigned and raged: Are you writing for ghosts?
21-24
09:15:52 Event Information 2023 Tainan International Green Industry Exhibition will be held on 5/24-26

For Tesseract we'll install the Chinese Simplified and Traditional packs:

apt-get -y install tesseract-ocr-chi-sim
apt-get -y install tesseract-ocr-chi-tra

And analyze the image using enhanced thresholding:

time tesseract ./CTV_20230504_010000/CTV_20230504_010000-000234.jpg output -l chi_tra+chi_sim --oem 1 --psm 3 -c thresholding_method=2; cat output.txt

Which yields:

1抗議

NM    沒間過就擅自
讓了 改了文章的标题
一~ 誠對 甚至連內容 部大幅則減。
大< 上 1]你放400個字跟放4000字-
全僻有“=a 其實真的沒差耶
全时   日4                      于,
二三 专桶遗天下杂读大改!其子佼清赂 她嘉'寅给中看呈?

09:15:52 性本月WE 2023喜南国际和绿色产业展 將於5/24-26日舉行

Translated to:

1 protest

NM did not pass without authorization
Let me change the title of the article
One~ Honestly, even the content will be greatly reduced.
Big < 1] You put 400 words and 4000 words-
There are "=a in fact, it's really not bad.
Full-time day 4 at,
23. A major revision of the miscellaneous readings left by the world! His son Jiaoqing bribed her, Jia'yin, to Zhonghua?

09:15:52 This month WE 2023 Xinan International and Green Industry Exhibition will be held on 5/24-26

What about extracting the chyrons from this CSPAN2 broadcast? This features relatively low resolution video with blurry text and a color gradient under the text.

Using Cloud Vision:

wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit.zip
unzip CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit.zip
time gsutil -m -q cp "./CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit/CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit-000238.jpg" gs://[YOURBUCKET]/
curl -s -H "Content-Type: application/json; charset=utf-8" -H "x-goog-user-project:[YOURPROJECTID]" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://vision.googleapis.com/v1/images:annotate -d '{ "requests": [ { "image": { "source": { "gcsImageUri": "gs://[YOURBUCKET]/CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit-000238.jpg" } }, "features": [ {"type":"TEXT_DETECTION"} ] } ] }' | jq -r .responses[].fullTextAnnotation.text

Yields a flawless transcription:

ANTI-DEFAMATION LEAGUE NATIONAL LEADERSHIP SUMMIT
EVELYN FARKAS
Former Deputy Assistant Defense Secretary for
Russia, Ukraine & Eurasia, Obama Administration
Monday
C-SPAN2

For Tesseract, we have to install the English pack:

apt-get -y install tesseract-ocr-eng

And then OCR:

time tesseract ./CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit/CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit-000238.jpg output -l eng --oem 1 --psm 3; cat output.txt

Which yields:

ANTI-DEFAMATION LEAGUE NATIONAL LEADERSHIP SUMMIT

EVELYN FARKAS
Former Deputy Assistant Defense Secretary for

Russia, Ukraine & Eurasia, Obama Administration
a Ut

Strangely, the enhanced thresholding yields a blank response, with no recognized text:

time tesseract ./CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit/CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit-000238.jpg output -l eng --oem 1 --psm 3 -c thresholding_method=2; cat output.txt

Finally, let's look at text-laden business news from this Bloomberg broadcast:

 

Let's look at Cloud Vision's transcript:

wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/BLOOMBERG_20230504_030000_Bloomberg_Markets_Asia.zip
unzip BLOOMBERG_20230504_030000_Bloomberg_Markets_Asia.zip
time gsutil -m -q cp "./BLOOMBERG_20230504_030000_Bloomberg_Markets_Asia/BLOOMBERG_20230504_030000_Bloomberg_Markets_Asia-000191.jpg" gs://[YOURBUCKET]/
curl -s -H "Content-Type: application/json; charset=utf-8" -H "x-goog-user-project:[YOURPROJECTID]" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://vision.googleapis.com/v1/images:annotate -d '{ "requests": [ { "image": { "source": { "gcsImageUri": "gs://[YOURBUCKET]/BLOOMBERG_20230504_030000_Bloomberg_Markets_Asia-000191.jpg" } }, "features": [ {"type":"TEXT_DETECTION"} ] } ] }' | jq -r .responses[].fullTextAnnotation.text

Which yields:

Bloomberg EUR-USD
1.1086
11:12 ET MAY 3
@BUSINESS
+0.0024 0.24%
Bloomberg Markets
Asia
NEXT
Roger
Bacon
Citi Global Wealth Investments
Head of UHNW Investments Asia
Fed rates outlook
Bloomberg Television Bloomberg.com
TV <GO>
USD-JPY
GBP-USD
134.49
1.2589
-0.22 0.22% +0.0025 0.24%
TOP NEWS
Bill Ackman Warns US
Regional Banking
System Is at Risk
PAGE 2 OF 3
First Republic Bank
was the second-biggest
bank failure in US
history, and the
fourth regional lender
to collapse since
early March after
Silvergate Capital
Corp., SVB Financial
Group's Silicon Valley
Bank and Signature
Bank.
EUR-NOK
EUR-SEK
11.8781
11.3310
-0.0177 0.14% -0.0206 0.24%
HB-EUR
).0267
UNC

Whereas Tesseract:

time tesseract ./BLOOMBERG_20230504_030000_Bloomberg_Markets_Asia/BLOOMBERG_20230504_030000_Bloomberg_Markets_Asia-000191.jpg output -l eng --oem 1 --psm 3; cat output.txt

Yields:

_
TOP NEWS
Bloomberg Markets NEXT /
Asia_ Bill Ackman Warns US
| Regional Banking
/ System Is at Risk
PAGE 20F3
R ra) r >First Republic Bank
O g was the second-biggest
bank failure in US

history, and the
rs | Cc Oo n fourth regional lender
to collapse since

<e early March after
Citi Global Wealth Investments Footishseltntsl hae
: Orp., inanci
Head of UHNW Investments Asia Group's Silicon Valley

Fed rates outlook Bank and Signature

Bloomberg Television Bloomberg.com
TVG»

Bloomberg EUR-USD USD-JPY GBP-" i= EUR- NOK EUR-SEK HB-EUR

oe 1.1086 a 134.49 ' 1.2589 | 11.8781 J 11.3310 aun icy Sree

Putting this all together, we see that Google's Cloud Vision OCR performs effectively flawlessly across all of our example images. No preprocessing or language selection is needed – simply hand it an image and let it handle the rest. In contrast, even the latest version of Tesseract struggles considerably even under relatively optimum conditions to extract useable text. The experiments above reinforce the criticality of preprocessing to Tesseract's accuracy, but also how channel-specific that preprocessing is: Sauvola thresholding yields the best results for some channels, while for others it prevents any text from being recognized at all. Based on these experiments, one potential solution might be to use Tesseract in an initial pass to identify all of the textual zones in each frame in which text appearance is relatively consistent (similar font and background color and similar font family, size and style), extract these as separate image files via ImageMagick, then OCR each independently. Alternatively, there are myriad customized Tesseract workflows on the web in which researchers have crafted bespoke thresholding algorithms using ImageMagick, Python scientific imaging libraries and other tools to carefully optimize text extraction for a specific domain. This suggests that with sufficient effort, it might be possible to create bespoke thresholding pipelines for each individual television news channel that sufficiently boosts OCR accuracy for that channel, but this in turn runs the risk that changes to the channel's layout over time (such as a shift in color schemes or fonts) could actually result in accuracy below the default baseline if the changes ran afoul of the bespoke customizations.

In the end, Cloud Vision offers human-level OCR accuracy across the range of languages Visual Explorer currently monitors, while further research will be required to explore different kinds of customized preprocessing workflows for Tesseract to boost its accuracy to a more useable level for most channels, though for chyron extraction for CSPAN, it did perform well, suggesting it could at least be used for chyron extraction at scale for English-language channels.