More TV News Experiments: Applying AI Visual Analysis & OCR To Television News Via Non-Consumptive Thumbnail Sheets

Could thumbnail contact sheets offer the possibility of non-consumptive visual assessment of television news, such as applying at-scale AI visual analysis and OCR to broadcasts? Let's take the 500 pixel contact sheet for the 1TV broadcast from April 26, 2022 3:30-4:01AM MSK that we created earlier. Skimming through the sheet we see that thumbnail #88 is a screen capture of an RIA Novosti online news article, offering an ideal test of how possible it might be to perform OCR recovery of at least full-screen text like this.

Thankfully, FFMPEG makes it trivial to decompose one of our contact sheets into a sequence of individual image files using its "untile" filter.

First, we have to convert the JPEG thumbnail sheet into a PNG file. For unknown reasons FFMPEG will error if we attempt to untile the original JPEG image. Since these are large images, they will typically exceed the default limits of ImageMagick. Thus, you will need to edit ImageMagick's "policy.xml" file on your machine to allow it to open much larger images:


And change the "width" and "height" lines to "100KP" or larger:

<policy domain="resource" name="width" value="100KP"/>
<policy domain="resource" name="height" value="100KP"/>

And verify the changes were recorded:

identify -list resource

Now convert to a PNG image:

time convert ./1TV_20220426_003000-fps4-500.jpg ./1TV_20220426_003000-fps4-500.png

Now we can simply run FFMPEG in untiling mode:

mkdir FRAMES
time ffmpeg -r 1 -i ./1TV_20220426_003000-fps4-500.png -vf "untile=6x120" "./FRAMES/1TV_20220426_003000-%6d.png"

Black frames aren't useful, so let's remove them by eliminating images that are less than 1K in size:

find ./FRAMES/ -name "*.png" -type 'f' -size -1000c -delete

Now take the 88th thumbnail from that directory ("1TV_20220426_003000-000088.png"):

Download that image to your computer and drag-drop it onto Google Cloud Vision's online demo page. You can see everything Cloud Vision sees about the image.

For example, this is the result of logo detection, showing it recognizing the RIA Novosti logo:

Similarly, OCR recognizes the onscreen text:


And extracts it as:

Материал дополняется
12:42 25.04.2022
ВС России и силы ДНР прекращают
боевые действия рядом с
ВС РФ и силы ДНР с 14:00 мск 25 апреля в
ностороннем порядке прекращают любые боевые
действия рядом с <<Азовсталью».
Для выхода мирных жителей из <<Азовстали>>
препятствий нет, за исключением решения Киева и
в удерживать мирных как «живой щит».

There are noticeable typos in the text, but overall it extracted a reasonable facsimile of the text, which Google Translate translates as:

The material is complemented
12:42 25.04.2022
Russian Armed Forces and DNR forces stop
fighting near
The Armed Forces of the Russian Federation and the forces of the DPR from 14:00 Moscow time on April 25 to
sideways stop any combat
actions near "Azovstal".
For the exit of civilians from <<Azovstal>>
there are no obstacles, except for the decision of Kyiv and
to hold the civilians as a "human shield".

What about a chyron example? Frame #158 contains a mixed chyron with a top two-level text container with a transparent background and a second two-level text block on a semi-transparent background:

Once again, we download it and run it through the Google Cloud Vision online demo.

It recognizes that there is a human face at 94% confidence, estimating it to have 1 degree roll, -2 degree tilt and -22 degree pan:

It recognizes that the image contains a person wearing glasses.

It recognizes the faint 1TV watermarked logo at top right.

It recognizes the onscreen text as well.

It extracts the text as:


Which Google Translate translates as:


Scanning for chyron text that appears while a person is visible onscreen could be one option to catalog onscreen interviewees.

Thus, from a simple non-consumptive thumbnail contact sheet, we can perform sufficiently advanced AI visual analysis and OCR text recovery!

While far less accurate than native video OCR and suffering from all of the compression artifacts of taking the original compressed MPEG video, converting to JPEG, then back to PNG, the end result is quite serviceable. Importantly, since it involves simple static image OCR, its computational demands are just a small fraction of full resolution video OCR, with a wide range of off-the-shelf OCR tools available compared with video.

A half-hour broadcast yields 1,800 seconds of airtime and since we sample it at one frame every 4 seconds, there are 450 resulting thumbnails. Using a state-of-the-art image OCR tool like Cloud Vision that charges $1.50 per 1,000 images, a half-hour broadcast using this workflow would cost just $0.75 cents versus $4.50 for native video OCR.