Television news across the world (though not in all countries) often contains significant amounts of onscreen text, from contextualizing chyrons to the names of interviewees to news scrolls of breaking events. This text can be OCR'd to generate a secondary searchable textual feed alongside traditional spoken word captioning. Unlike most other traditional OCR tasks, television news OCR tends to offer a worst-case scenario, with a pathological diversity of text colors, text fonts, text sizes, text orientations, background colors and gradients and even transparent overlays, motion blur, fade in/out and other animation effects and myriad other artifacts that are simply not a factor in traditional OCR. How well do state-of-the-art commercial and open source OCR systems perform?
For SOTU commercial OCR, we'll examine GCP's Cloud Vision API, while for open source OCR we'll use Tesseract. Both tools support multilingual OCR in a range of languages, with Google supporting 133 languages supported today, along with handwritten OCR for 9 languages.
GCP's Cloud Vision API is a hosted API, so there is nothing to install locally. For Tesseract, we can install like any other package:
apt-get -y install tesseract-ocr tesseract --list-langs
At the moment, this installs version 4.1.1 on current Debian systems. For the purposes of testing the latest version of Tesseract, we created a temporary VM and followed these instructions to download a precompiled version of the latest 5.x version:
apt-get install apt-transport-https echo "deb https://notesalexp.org/tesseract-ocr5/$(lsb_release -cs)/ $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/notesalexp.list > /dev/null apt-get update -oAcquire::AllowInsecureRepositories=true apt-get install notesalexp-keyring -oAcquire::AllowInsecureRepositories=true apt-get install tesseract-ocr
This yields the latest version:
tesseract -v tesseract 5.3.1 leptonica-1.79.0 libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0 Found AVX512BW Found AVX512F Found AVX512VNNI Found AVX2 Found AVX Found FMA Found SSE4.1 Found OpenMP 201511 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8 Found libcurl/7.74.0 OpenSSL/1.1.1n zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh2/1.9.0 nghttp2/1.43.0 librtmp/2.3
We have to install Tesseract's language packs separately. To install support for Russian, we use:
apt-get -y install tesseract-ocr-rus
Note that unlike Cloud Vision, Tesseract requires us to specify the list of languages to search for text in for the image, meaning we have to apriori know the complete list of languages to expect in each image. For television news we typically know the dominate language(s) of a given broadcaster, but this does mean that we will miss unexpected language occurrences, such as Urdu text appearing on an ABC Evening News broadcast.
To analyze a single image through Cloud Vision we can use:
time gsutil -m -q cp "./IMAGE.jpg" gs://[YOURBUCKET]/ curl -s -H "Content-Type: application/json; charset=utf-8" -H "x-goog-user-project:[YOURPROJECTID]" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://vision.googleapis.com/v1/images:annotate -d '{ "requests": [ { "image": { "source": { "gcsImageUri": "gs://[YOURBUCKET]/IMAGE.jpg" } }, "features": [ {"type":"TEXT_DETECTION"} ] } ] }' | jq -r .responses[].fullTextAnnotation.text
While to analyze a single image through Tesseract we can use:
time tesseract ./IMAGE.jpg output -l eng --oem 1 --psm 3; cat output.txt
To analyze an entire directory of images through Cloud Vision we can use:
wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA24_20230306_123000_RIK_Rossiya_24.zip unzip RUSSIA24_20230306_123000_RIK_Rossiya_24.zip time gsutil -m -q cp "./RUSSIA24_20230306_123000_RIK_Rossiya_24/*.jpg" gs://[YOURBUCKET]/ mkdir TXT_CV rm PAR.LOG time find ./RUSSIA24_20230306_123000_RIK_Rossiya_24/ -depth -name "*.jpg" | parallel -j 7 --resume --joblog ./PAR.LOG --eta "[ ! -f ./TXT_CV/{/.}.json ] && curl -s -H \"Content-Type: application/json; charset=utf-8\" -H "x-goog-user-project:[YOURPROJECTID]" -H \"Authorization: Bearer $(gcloud auth print-access-token)\" https://vision.googleapis.com/v1/images:annotate -d '{ \"requests\": [ { \"image\": { \"source\": { \"gcsImageUri\": \"gs://[YOURBUCKET]/{/}\" } }, \"features\": [ {\"type\":\"TEXT_DETECTION\"} ] } ] }' > ./TXT_CV/{/.}.json"
Using the limit above of 7 images in flight at any moment (you can increase this immensely depending on your quota), this takes just 41 seconds.
While for Tesseract we can use:
wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA24_20230306_123000_RIK_Rossiya_24.zip unzip RUSSIA24_20230306_123000_RIK_Rossiya_24.zip mkdir TXT_TES rm PAR.LOG time find ./RUSSIA24_20230306_123000_RIK_Rossiya_24/ -depth -name "*.jpg" | parallel --resume --joblog ./PAR.LOG --eta 'tesseract {} ./TXT_TES/{/.} -l rus --oem 1 --psm 3'
This takes around 16 minutes on a 16-core C2 VM.
Let's take a typical frame from a March 6th Russia 24 broadcast, in which there is is a top text scroll of blue text over yellow background, bordered by black text over white, white text over dark blue, white text over light blue and white text over red, with two levels of bottom text, including white on medium-dark blue and black on white, with a bold red textual stamp towards the lower-left of the frame with transparent background over textured blue. This diverse assortment of text colors, background colors, fonts and pointsizes is typical for the broadcast medium and presents a complicated scenario for OCR, with multiple zones with very different recognition requirements.
Let's extract Cloud Vision's transcription of this frame:
cat RUSSIA24_20230306_123000_RIK_Rossiya_24-000382.json | jq -r .responses[].fullTextAnnotation.text
This yields (here we're extracting the final full-frame text as a blob, without incorporating location information):
15:54 КРЕМЛЬ: и Раиси позитивно оценили уровень и динамику отношений РФ и Ирана РОССИЯ 24 UA СТОП ФЕЙК "РОСКОСМОС": ГРУЗОВИК "ПРОГРЕСС МС-22" УВЕЛ МЕЖДУНАРОДНУЮ КОСМИЧЕСКУЮ СТАНЦИЮ ОТ КОСМИЧЕСКОГО МУСОРА М.Мишустин: Туристический сектор должен стать опорным направлением развития СКФО
Overall, this is an extremely accurate transcription of the frame despite the pathological conditions. Google Translate translates as:
15:54 KREMLIN: and Raisi positively assessed the level and dynamics of relations between Russia and Iran RUSSIA 24 U.A. STOP FAKE ROSCOSMOS: THE PROGRESS MS-22 TRUCK SEEKED THE INTERNATIONAL SPACE STATION FROM SPACE DEBRIS Mikhail Mishustin: The tourism sector should become the backbone of the development of the North Caucasus Federal District
Now let's try Tesseract using its default settings:
time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382.jpg output -l rus --oem 1 --psm 3; cat output.txt
This yields:
15:54 УЗИ = Раиси позитивно оценили уровеньи динамику отношений РФ и Ирана [8.1 тд Ш. "| СТОП ФЕЙ т ь . РОСКОСМОС*: ГРУЗОВИК "ПРОГРЕСС МС-22" УВЁЛ МЕЖДУНАРОДНУЮ КОСМИЧЕСКУЮ СТАНЦИЮ ОТ КОСМИЧЕСКОГО МУСОРА. М. мишустин: Туристический сектор должен стать опорным направлением развития СКФО
This is much less accurate and translates via Google Translate as:
15:54 UZI = Raisi positively assessed the level and dynamics of relations between the Russian Federation and Iran [8.1 td Sh. "| STOP FEI t . ROSCOSMOS *: TRUCK "PROGRESS MS-22" SEEKED INTERNATIONAL SPACE STATION FROM SPACE GARBAGE. M. Mishustin: Tourism sector should become the backbone of the development of the North Caucasus Federal District
By default, Tesseract uses its own Otsu thresholding – what if we instead use Leptonica's Ostu implementation via the thresholding_method parameter?
time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382.jpg output -l rus --oem 1 --psm 3 -c thresholding_method=1; cat output.txt
This yields slightly different results:
№ но оценили уровеньи динамику отношений РФ и Ирана РОССИЯ. 2а = | : Е НР — _ ие и *РОСКОСМОС": ГРУЗОВИК "ПРОГРЕСС МС-22" УВЁЛ МЕЖДУНАРОДНУЮ КОСМИЧЕСКУЮ СТАНЦИЮ ОТ КОСМИЧЕСКОГО МУСОРА м. мишустин: Туристический сектор должен стать опорным направлением развития СКФО Г г. Е СТОП ФЕЙК 8
Which translates as:
No. but assessed the level and dynamics of relations between the Russian Federation and Iran RUSSIA. 2a = | : E HP - _ no And *ROSCOSMOS": TRUCKS "PROGRESS MS-22" SEEKED INTERNATIONAL SPACE STATION FROM SPACE DEBRIS M. Mishustin: The tourism sector should become the backbone of the development of the North Caucasus Federal District G G. E STOP FAKE 8
Or Sauvola thresholding:
time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382.jpg output -l rus --oem 1 --psm 3 -c thresholding_method=2; cat output.txt
Which yields:
15:54 УЗИ = Раиси позитивно оценили уровеньи динамику отношений РФ и Ирана [8.1 тд Ш. "| СТОП ФЕЙ т ь . РОСКОСМОС*: ГРУЗОВИК "ПРОГРЕСС МС-22" УВЁЛ МЕЖДУНАРОДНУЮ КОСМИЧЕСКУЮ СТАНЦИЮ ОТ КОСМИЧЕСКОГО МУСОРА. М. мишустин: Туристический сектор должен стать опорным направлением развития СКФО
Which translates to:
15:54 UZI = Raisi positively assessed the level and dynamics of relations between Russia and Iran [8.1 td Sh. "| STOP FEY t b . ROSCOSMOS*: TRUCK "PROGRESS MS-22" SEEKED INTERNATIONAL SPACE STATION FROM SPACE DEBRIS. M. Mishustin: The tourism sector should become the backbone of the development of the North Caucasus Federal District
This yields the closest match to Cloud Vision's OCR (though still with key errors), demonstrating the critical importance of thresholding when using Tesseract.
What if we preprocess using ImageMagick and convert to greyscale before handing to Tesseract?
apt-get -y install imagemagick convert ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382.jpg -type Grayscale ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382-thresh.png time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382-thresh.png output -l rus --oem 1 --psm 3; cat output.txt
This yields:
[15:54] КРЕМЛ иси позитивно оценили уровеньи динамику отношений РФ и Ирана ИИ 8/2 | *РОСКОСМОС*: ГРУЗОВИК "ПРОГРЕСС МС-22" УВЁЛ МЕЖДУНАРОДНУЮ КОСМИЧЕСКУЮ СТАНЦИЮ ОТ КОСМИЧЕСКОГО МУСОРА м. мишустин: Туристический сектор должен стать опорным направлением развития СКФО
Translating to:
[15:54] The KREMLIS positively assessed the level and dynamics of relations between Russia and Iran AI 8/2 | *ROSCOSMOS*: TRUCK "PROGRESS MS-22" SEEKED INTERNATIONAL SPACE STATION FROM SPACE DEBRIS M. Mishustin: The tourism sector should become the backbone of the development of the North Caucasus Federal District
This yields even better results than our best native Tesseract results, again reinforcing the sensitivity of Tesseract to the thresholding process.
What if we try a Gaussian filter and resizing to 150%, which we found in previous versions of Tesseract to noticeably improve recognition accuracy?
convert ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382.jpg -type Grayscale -filter Gaussian -resize 150% ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382-thresh.png time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000382-thresh.png output -l rus --oem 1 --psm 3; cat output.txt
This yields:
КРЕМЛЬ: ›аиси позитивно оценили уровень и динамику отношений РФ и Ирана [8:7 щи => №. | *РОСКОСМОС*: ГРУЗОВИК "ПРОГРЕСС МС-22* УВЕЛ МЕЖДУНАРОДНУЮ КОСМИЧЕСКУЮ СТАНЦИЮ ОТ КОСМИЧЕСКОГО МУСОРА м. мишустин: Туристический сектор должен стать опорным направлением развития СКФО
Which translates to:
KREMLIN: ›AISI positively assessed the level and dynamics of relations between Russia and Iran [8:7 cabbage soup => No. | *ROSCOSMOS*: TRUCK "PROGRESS MS-22* DRIVED THE INTERNATIONAL SPACE STATION FROM SPACE DEBRIS" M. Mishustin: The tourism sector should become the backbone of the development of the North Caucasus Federal District
This time it captures Kremlin properly, but at the cost of adding some additional noise. If we don't convert to greyscale first, the results are even worse.
What about a different frame from that broadcast with more text?
Let's look at Cloud Vision's transcription:
cat RUSSIA24_20230306_123000_RIK_Rossiya_24-000154.json | jq -r .responses[].fullTextAnnotation.text
Which yields:
15.39 КРЕМЛЬ: дальнейших контактов Состоялся телефонный разговор Владимир РОССИЯ 24 ВОДОРОДНАЯ ПАЛИКОМ СОЗДАЮТ И ЭНЕРГЕТИКА КОНСОРЦИУМ ДЛЯ ПРОДВИЖЕНИЯ ОБОРУДОВАНИЯ В СФЕРЕ ВОДОРОДНОЙ ЭНЕРГЕТИКИ НОВЫЕ УСТАНОВКИ БУДУТ ПРИМЕНЯТЬСЯ В МИКРОЭЛЕКТРОНИКЕ, МЕТАЛЛУРГИИ И ДРУГИХ НАПРАВЛЕНИЯХ М. Мишустин: Туристический сектор должен стать опорным направлением развития СКФО
Translated to:
15.39 KREMLIN: further contacts A telephone conversation took place Vladimir RUSSIA 24 HYDROGEN PALICOM CREATE AND ENERGY CONSORTIUM FOR PROMOTIONS EQUIPMENT IN THE SPHERE HYDROGEN ENERGY NEW INSTALLATIONS WILL BE APPLY IN MICROELECTRONICS, METALLURGY AND OTHER DIRECTIONS M. Mishustin: The tourism sector should become the backbone of the development of the North Caucasus Federal District
Let's try Tesseract with its default settings:
time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000154.jpg output -l rus --oem 1 --psm 3; cat output.txt
Which yields just a small fraction of the text:
15.39 АЯ ЗМ" альнейших контактов я, м. мишустин: Туристическ ектор должен стать о щ * ,) 2 Состоялся телефонный разговор Владимир: |:{18(8%}] 2а ВОДОРОДНАЯ ЭНЕРГЕТИКА сы НОВЫЕ УСТАНОВКИ БУДУТ - ПРИМЕНЯТЬСЯ * ВМИКРОЭЛЕКТРОНИКЕ, "* МЕТАЛЛУРГИИ И ДРУГИХ НАПРАВЛЕНИЯХ ным направлением развития СКФО
Even thresholding fails to improve the results much:
time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000154.jpg output -l rus --oem 1 --psm 3 -c thresholding_method=2; cat output.txt
Which yields:
№ 15.9 ЗАЗ ШЕ альнейших контактов Состоялся телефомный разговор Владимир риы вод ОР од АЯ "РУСАТОМ ОВЕРСИЗ" И "ПОЛИКОМ" СОЗДАЮТ КОНСОРЦИУМ ДЛЯ _ ЭН НЕРГЕТИКА ПРОДВИЖЕНИЯ ОБОРУДОВАНИЯ В СФЕРЕ ВОДОРОДНОЙ ЭНЕРГЕТИКИ УРГИИ И ДРУГИХ НАПРАВЛЕНИЯХ м. мишустин: Туристический сектор должен стать опорным направлением развития СКФО
Translated as:
No. 15.9 ZAZ SHEE of further contacts A telephone conversation took place Vladimir riy water OR OD AYA "RUSATOM OVERSEAS" AND "POLICOM" CREATE CONSORTIUM FOR _ ENERGY PROMOTION EQUIPMENT IN THE SPHERE HYDROGEN ENERGY URGII AND OTHER DIRECTIONS M. Mishustin: The tourism sector should become the backbone of the development of the North Caucasus Federal District
Grayscale conversion similarly fails to improve the results by much.
What about this example of a large single block of text, with part of it highlighted?
Once again, Cloud Vision recovers it extremely well:
cat RUSSIA24_20230306_123000_RIK_Rossiya_24-000245.json | jq -r .responses[].fullTextAnnotation.text
Yielding:
15:45 КРЕМЛь: состоялся телефонный разговор Владимира Путина с Президентом Ислам РОССИЯ 24 раненным в результате обстрела жителям, попала под повторный обстрел с украинской стороны. СТОП ФЕЙК Ранее представитель силовых структур ДНР сообщал о троих погибших сотрудниках бригады. Как следует из данных представительства ДНР в Совместном центре контроля и координации вопросов, связанных с военными преступлениями Украины, украинские войска в четверг дважды обстреливали Петровский район с применением реактивных систем залпового огня, выпустив по нему в общей сложности 23 снаряда. Так же они открывали по нему огонь из артиллерийских орудий натовского калибра 155 мм. В администрации района сообщали об одной погибшей женщине и раненом мужчине из числа мирных жителей. Теги: Украина Россия Военная операция на Украине МО РФ: Средства ПВО России за сутки в ходе СВо сбили 15 снарядов РСЗО HIMARS И <<УparaH>>
Translated as:
15:45 KREMLIN: Vladimir Putin had a telephone conversation with the President Islam RUSSIA 24 wounded as a result of the shelling of residents, came under repeated shelling from Ukrainian side. STOP FAKE Earlier, a representative of the power structures of the DPR reported about three dead employees. brigades. As follows from the data of the DPR representation in the Joint Center for Control and coordination of issues related to war crimes in Ukraine, Ukrainian troops on Thursday shelled the Petrovsky district twice from the use of multiple launch rocket systems, firing at him in total difficulty 23 rounds. They also opened fire on him from artillery 155 mm NATO caliber guns. The district administration reported one a dead woman and a wounded civilian man. Tags: Ukraine Russia Military operation in Ukraine Russian Defense Ministry: Russian air defense systems shot down 15 MLRS HIMARS AND <<УparaH>> shells per day during the SVO
In contrast, once again Tesseract fails to recover much of the text even with the improved thresholding:
time tesseract ./RUSSIA24_20230306_123000_RIK_Rossiya_24-000245.jpg output -l rus --oem 1 --psm 3 -c thresholding_method=2; cat output.txt
Yielding:
15:45 УАНЗУЛИЕЯ состоялся телефонный разговор Владимира Путина с Президентом Исла! "РОССИЯ 2А раненным в результате обстрела жителям, попала под повторный обстрел с украинской стороны. Ранее представитель силовых структур ДНР сообщал о троих погибших сотрудника; бригады. Как следует из данных представительства ДНР в Совместном центре контроля и В администрации района сообщали об одной кенщине и раненом мужчине из числа мирных жителей. 6 погибшей Теги: Украина Россия Военная операция на Украине № МОРФ: Средства ПВО России за сутки в ходе СВО сбили 15 снарядов РСЗО Н!МАЯ$ и «Ураган»
Translated as:
15:45 UANZULIEYA Vladimir Putin had a telephone conversation with the President of Isla! "RUSSIA 2A wounded as a result of the shelling of residents, came under repeated shelling from Ukrainian side. Earlier, a representative of the power structures of the DPR reported three dead employees; brigades. As follows from the data of the DPR representation in the Joint Center for Control and The district administration reported one a woman and a wounded civilian man. 6 deceased Tags: Ukraine Russia Military operation in Ukraine No. MORF: Russian air defense systems shot down 15 MLRS N!MAY$ and Uragan shells per day during the NMD
What about this frame from a Persian-language IRINN broadcast?
First let's see how Cloud Vision transcribes the image:
wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/IRINN_20230504_033000.zip unzip IRINN_20230504_033000.zip time gsutil -m -q cp "./IRINN_20230504_033000/IRINN_20230504_033000-000084.jpg" gs://[YOURBUCKET]/ curl -s -H "Content-Type: application/json; charset=utf-8" -H "x-goog-user-project:[YOURPROJECTID]" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://vision.googleapis.com/v1/images:annotate -d '{ "requests": [ { "image": { "source": { "gcsImageUri": "gs://[YOURBUCKET]/IRINN_20230504_033000-000084.jpg" } }, "features": [ {"type":"TEXT_DETECTION"} ] } ] }' | jq -r .responses[].fullTextAnnotation.text
Which yields:
TW Masih Alinejad @Alinejad Masih #حمیدرضا_الداعی که چند روز پیش در شهر سبزوار (شهر سربداران)، برای نجات دو دختر از دست مزاحمین و گروهی از اراذل و اوباش با آنها درگیر شد و بر اثر اصابت چاقو جانش را از دست داده نه بسیجی بود و نه اصلا شباهتی به بسیجی ها داشت. به شهادت نزدیکانش او قبلا در جریان انقلاب #زن_زندگی_آزادی توسط همین بسیجی ها که امروز جسم بی جانش را مصادره کرده اند، با شوکر مورد حمله و ضرب و شتم قرار گرفته بود. Translate Tweet خبر ۷ ۰۷:۰۵ رئیسان جمهور برزیل و آرژانتین توافق کردند، دلار را از مبادلات تجاری خود حذف کنند
Translated to:
TW Masih Alinejad @Alinejad Masih #Hamidreza_al-Da'ee who a few days ago in Sabzevar city (Sarbdaran city), to save two girls from The hands of intruders and a group of thugs got into a fight with them and he lost his life due to a knife injury. It was not a Basiji, nor was it similar to the Basijis at all. According to the testimony of his relatives, he was earlier during the revolution of #Zen_Zandagi_Azadi by the same Basijs that today His lifeless body has been confiscated, he was attacked and beaten with a stun gun. Translate Tweet News 7 07:05 The presidents of Brazil and Argentina agreed to remove the dollar from their trade
For Tesseract we have to install the Persian language pack:
apt-get -y install tesseract-ocr-fas
With the default settings:
time tesseract ./IRINN_20230504_033000/IRINN_20230504_033000-000084.jpg output -l fas --oem 1 --psm 3; cat output.txt
We get:
۲ 4زهصنا۸ طنو۱۸۵ 6۸۵۲ ر , که چند روز پیش در شهر سبزوار (شهر سربداران)» برای نجات دو دختر از دست مزاحمین و گروهی از اراثل و اوباش با آنها درگیر شد و بر اثر اصابت چاقو جانش را از دست داد ار
Translated to:
2 4 Zahesna 8 Tanu 185 6852 A few days ago in Sabzevar city (Sarbdaran city) to save two girls from The intruders and a group of Arathal and mobs got into a fight with them and he died due to a knife injury.
Using improved thresholding:
time tesseract ./IRINN_20230504_033000/IRINN_20230504_033000-000084.jpg output -l fas --oem 1 --psm 3 -c thresholding_method=2; cat output.txt
We get:
2 ۱ #حمیدرضا_الداغی که چند روز پیش در شهر سبزوار (شهر سربداران)» برای نجات دو دختر از دست مزاحمین و گروهی از اراثل و اوباش با آنها درگیر شد و بر اثر اصابت چاقو جانش را از دست داد نه بسیجی بود و نه اصلا شباهتی به بسیجیها داشت. به شهادت نزدیکانش او قبلا در جریان انقلاب #زن_زندگی_آزادی توسط همین بسیجی ها که امروز جسم بی جانش را مصادره کرده انده با شرکر مورد حمله و ضرب و شتم قرار گرفته بود, ) رنیسانجمهور برزیل و آرژانتین توا ۳ کردند. دلار را از میادلات تجاری خود حذف کنند 7777 ۲ ۲۲
Translated to:
2 1 #Hamidreza_Aldaghi who a few days ago in Sabzevar city (Sarbdaran city) to save two girls from The intruders and a group of Arathal and mobs got into a fight with them and he died due to a knife injury. It was not a Basiji, nor was it similar to the Basijis at all. According to the testimony of his relatives, he was earlier during the revolution of #Zen_Zandagi_Azadi by the same Basijs that today His lifeless body was confiscated and he was assaulted and beaten with a thug. ) Renaissance of the Republic of Brazil and Argentina 3 they did Remove the dollar from their business transactions 7777 2 22
What about this Taiwanese broadcast?
We'll first try Cloud Vision:
wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/CTV_20230504_010000.zip unzip CTV_20230504_010000.zip time gsutil -m -q cp "./CTV_20230504_010000/CTV_20230504_010000-000234.jpg" gs://[YOURBUCKET]/ curl -s -H "Content-Type: application/json; charset=utf-8" -H "x-goog-user-project:[YOURPROJECTID]" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://vision.googleapis.com/v1/images:annotate -d '{ "requests": [ { "image": { "source": { "gcsImageUri": "gs://[YOURBUCKET]/CTV_20230504_010000-000234.jpg" } }, "features": [ {"type":"TEXT_DETECTION"} ] } ] }' | jq -r .responses[].fullTextAnnotation.text
Which yields:
黃子佼槓天下雜誌 黃子佼 00-21更新 一直以來的筆耕。 顯示更多 |≡ 天下 獨家 · 深入星宇「心殿」 再填錢也要飛北美,張 Off婆文化 下單後價格竟五級跳!黃子 佼:網購黑膠的瞎事與驚喜 【黃子佼專欄】網路競標珍貴黑膠和 CD,其中往往暗藏陷阱。下標前要仔 細看清,避免以為賺到最後卻傷心。 文章語音、03:38 打牌天下 Sky 幫你讀文章 台北 GMA Vore 20 encore! 字嘉蜜藏人千一年 中視新聞HD 藝人黃子佼抗議! 沒問過就擅自 改了文章的標題 甚至連內容 都大幅刪減 你放400個字跟放4000字 其實真的沒差耶 上海 寶 專欄遭天下雜誌大改!黃子佼請辭 怒轟:寫給鬼看嗎? 21-24 09:15:52 活動訊息 2023臺南國際綠色產業展 將於5/24-26日舉行
Translated to:
Huang Zi leads the world magazine Huang Zijiao 00-21 update All along the pen. display more |≡ world Exclusive · Go deep into Xingyu's "Heart Palace" and fly to North America after filling in the money, Zhang Off-law culture After the order was placed, the price jumped five levels! Huang Zi Outstanding: The Stories and Surprises of Online Vinyl Shopping [Huang Zijiao Column] Online bidding for precious vinyl and CD, which often hides traps. Be careful before bidding Take a closer look and avoid thinking that you will be sad in the end. Article Voice, 03:38 playing cards Sky reads articles for you Taipei GMA Vore 20 encore! Zijia honey Tibetan people for a thousand years CTV News HD Artist Huang Zijiao protests! without asking Changed the title of the article Even the content has been greatly reduced You put 400 words and put 4000 words Actually it's not bad Shanghai Treasure column has been greatly changed by Tianxia Magazine! Huang Zijiao resigned and raged: Are you writing for ghosts? 21-24 09:15:52 Event Information 2023 Tainan International Green Industry Exhibition will be held on 5/24-26
For Tesseract we'll install the Chinese Simplified and Traditional packs:
apt-get -y install tesseract-ocr-chi-sim apt-get -y install tesseract-ocr-chi-tra
And analyze the image using enhanced thresholding:
time tesseract ./CTV_20230504_010000/CTV_20230504_010000-000234.jpg output -l chi_tra+chi_sim --oem 1 --psm 3 -c thresholding_method=2; cat output.txt
Which yields:
1抗議 NM 沒間過就擅自 讓了 改了文章的标题 一~ 誠對 甚至連內容 部大幅則減。 大< 上 1]你放400個字跟放4000字- 全僻有“=a 其實真的沒差耶 全时 日4 于, 二三 专桶遗天下杂读大改!其子佼清赂 她嘉'寅给中看呈? 09:15:52 性本月WE 2023喜南国际和绿色产业展 將於5/24-26日舉行
Translated to:
1 protest NM did not pass without authorization Let me change the title of the article One~ Honestly, even the content will be greatly reduced. Big < 1] You put 400 words and 4000 words- There are "=a in fact, it's really not bad. Full-time day 4 at, 23. A major revision of the miscellaneous readings left by the world! His son Jiaoqing bribed her, Jia'yin, to Zhonghua? 09:15:52 This month WE 2023 Xinan International and Green Industry Exhibition will be held on 5/24-26
What about extracting the chyrons from this CSPAN2 broadcast? This features relatively low resolution video with blurry text and a color gradient under the text.
Using Cloud Vision:
wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit.zip unzip CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit.zip time gsutil -m -q cp "./CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit/CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit-000238.jpg" gs://[YOURBUCKET]/ curl -s -H "Content-Type: application/json; charset=utf-8" -H "x-goog-user-project:[YOURPROJECTID]" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://vision.googleapis.com/v1/images:annotate -d '{ "requests": [ { "image": { "source": { "gcsImageUri": "gs://[YOURBUCKET]/CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit-000238.jpg" } }, "features": [ {"type":"TEXT_DETECTION"} ] } ] }' | jq -r .responses[].fullTextAnnotation.text
Yields a flawless transcription:
ANTI-DEFAMATION LEAGUE NATIONAL LEADERSHIP SUMMIT EVELYN FARKAS Former Deputy Assistant Defense Secretary for Russia, Ukraine & Eurasia, Obama Administration Monday C-SPAN2
For Tesseract, we have to install the English pack:
apt-get -y install tesseract-ocr-eng
And then OCR:
time tesseract ./CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit/CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit-000238.jpg output -l eng --oem 1 --psm 3; cat output.txt
Which yields:
ANTI-DEFAMATION LEAGUE NATIONAL LEADERSHIP SUMMIT EVELYN FARKAS Former Deputy Assistant Defense Secretary for Russia, Ukraine & Eurasia, Obama Administration a Ut
Strangely, the enhanced thresholding yields a blank response, with no recognized text:
time tesseract ./CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit/CSPAN2_20230504_110200_Susan_Rice_Speaks_at_Anti-Defamation_League_Leadership_Summit-000238.jpg output -l eng --oem 1 --psm 3 -c thresholding_method=2; cat output.txt
Finally, let's look at text-laden business news from this Bloomberg broadcast:
Let's look at Cloud Vision's transcript:
wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/BLOOMBERG_20230504_030000_Bloomberg_Markets_Asia.zip unzip BLOOMBERG_20230504_030000_Bloomberg_Markets_Asia.zip time gsutil -m -q cp "./BLOOMBERG_20230504_030000_Bloomberg_Markets_Asia/BLOOMBERG_20230504_030000_Bloomberg_Markets_Asia-000191.jpg" gs://[YOURBUCKET]/ curl -s -H "Content-Type: application/json; charset=utf-8" -H "x-goog-user-project:[YOURPROJECTID]" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://vision.googleapis.com/v1/images:annotate -d '{ "requests": [ { "image": { "source": { "gcsImageUri": "gs://[YOURBUCKET]/BLOOMBERG_20230504_030000_Bloomberg_Markets_Asia-000191.jpg" } }, "features": [ {"type":"TEXT_DETECTION"} ] } ] }' | jq -r .responses[].fullTextAnnotation.text
Which yields:
Bloomberg EUR-USD 1.1086 11:12 ET MAY 3 @BUSINESS +0.0024 0.24% Bloomberg Markets Asia NEXT Roger Bacon Citi Global Wealth Investments Head of UHNW Investments Asia Fed rates outlook Bloomberg Television Bloomberg.com TV <GO> USD-JPY GBP-USD 134.49 1.2589 -0.22 0.22% +0.0025 0.24% TOP NEWS Bill Ackman Warns US Regional Banking System Is at Risk PAGE 2 OF 3 First Republic Bank was the second-biggest bank failure in US history, and the fourth regional lender to collapse since early March after Silvergate Capital Corp., SVB Financial Group's Silicon Valley Bank and Signature Bank. EUR-NOK EUR-SEK 11.8781 11.3310 -0.0177 0.14% -0.0206 0.24% HB-EUR ).0267 UNC
Whereas Tesseract:
time tesseract ./BLOOMBERG_20230504_030000_Bloomberg_Markets_Asia/BLOOMBERG_20230504_030000_Bloomberg_Markets_Asia-000191.jpg output -l eng --oem 1 --psm 3; cat output.txt
Yields:
_ TOP NEWS Bloomberg Markets NEXT / Asia_ Bill Ackman Warns US | Regional Banking / System Is at Risk PAGE 20F3 R ra) r >First Republic Bank O g was the second-biggest bank failure in US history, and the rs | Cc Oo n fourth regional lender to collapse since <e early March after Citi Global Wealth Investments Footishseltntsl hae : Orp., inanci Head of UHNW Investments Asia Group's Silicon Valley Fed rates outlook Bank and Signature Bloomberg Television Bloomberg.com TVG» Bloomberg EUR-USD USD-JPY GBP-" i= EUR- NOK EUR-SEK HB-EUR oe 1.1086 a 134.49 ' 1.2589 | 11.8781 J 11.3310 aun icy Sree
Putting this all together, we see that Google's Cloud Vision OCR performs effectively flawlessly across all of our example images. No preprocessing or language selection is needed – simply hand it an image and let it handle the rest. In contrast, even the latest version of Tesseract struggles considerably even under relatively optimum conditions to extract useable text. The experiments above reinforce the criticality of preprocessing to Tesseract's accuracy, but also how channel-specific that preprocessing is: Sauvola thresholding yields the best results for some channels, while for others it prevents any text from being recognized at all. Based on these experiments, one potential solution might be to use Tesseract in an initial pass to identify all of the textual zones in each frame in which text appearance is relatively consistent (similar font and background color and similar font family, size and style), extract these as separate image files via ImageMagick, then OCR each independently. Alternatively, there are myriad customized Tesseract workflows on the web in which researchers have crafted bespoke thresholding algorithms using ImageMagick, Python scientific imaging libraries and other tools to carefully optimize text extraction for a specific domain. This suggests that with sufficient effort, it might be possible to create bespoke thresholding pipelines for each individual television news channel that sufficiently boosts OCR accuracy for that channel, but this in turn runs the risk that changes to the channel's layout over time (such as a shift in color schemes or fonts) could actually result in accuracy below the default baseline if the changes ran afoul of the bespoke customizations.
In the end, Cloud Vision offers human-level OCR accuracy across the range of languages Visual Explorer currently monitors, while further research will be required to explore different kinds of customized preprocessing workflows for Tesseract to boost its accuracy to a more useable level for most channels, though for chyron extraction for CSPAN, it did perform well, suggesting it could at least be used for chyron extraction at scale for English-language channels.