Advanced OCR Similarity Metrics Using BigQuery + Video AI To Segment Television Evening News

Yesterday we showed how a single SQL query in BigQuery can be used to segment a television news show by measuring the Levenshtein edit distance between the OCR'd text of the first second of each shot, using BigQuery's persistent UDF capability. Today we're excited to show how BigQuery's UDF capability can offer even more advanced text similarity metrics, following Felipe Hoffa's UDF examples of applying the fuzzball.js library at BigQuery scale!

To recap, BigQuery's UDFs allow you to load arbitrary JavaScript libraries into BigQuery and invoke them as if they were a built-in function, applying them at BigQuery scale across your data with relatively few limitations. Returning to yesterday's television news example, our goal is to split a television news broadcast into camera "shots" using the Video AI API and then take all of the OCR'd onscreen text in the first second of each shot and compare it to that of the previous shot, measuring the textual similarity in order to estimate whether each new shot represents a continuation of the previous story or the start of a new story.

Yesterday's example computed similarity as the Levenshtein edit distance, which is not ideal, since if the onscreen text is a subset of the previous text (such as "Latest Covid-19 Updates" and "Latest Covid-19 Updates: The Latest Infection Numbers"), the Levenshtein distance will correctly treat them as distinct strings, whereas for the purpose of story segmentation we'd prefer a metric that doesn't penalize substrings.

It turns out that the fuzzball.js library that provides the Levenshtein edit distance function we used yesterday also has a wealth of additional metrics, including "Partial Ratio" that computes the "Highest scoring substring of the longer string vs. the shorter string" and "Token Sort Ratio" defined as "Tokenized, sorted, and then recombined before scoring" and "Token Set Ratio" defined as "Highest of 3 scores comparing the set intersection, intersection + difference 1 to 2, and intersection + difference 2 to 1."

Invoking these additional metrics is as simple as adding an extra bit of UDF code to the top of our query:

CREATE TEMP FUNCTION fuzz_tokensetratio(a string, b string) RETURNS INT64 LANGUAGE js AS """ return fuzzball.token_set_ratio(a,b); """ OPTIONS (library="gs://fh-bigquery/js/fuzzball.umd.min.js");
CREATE TEMP FUNCTION fuzz_tokensortratio(a string, b string) RETURNS INT64 LANGUAGE js AS """ return fuzzball.token_sort_ratio(a,b); """ OPTIONS (library="gs://fh-bigquery/js/fuzzball.umd.min.js");
CREATE TEMP FUNCTION fuzz_partialratio(a string, b string) RETURNS INT64 LANGUAGE js AS """ return fuzzball.partial_ratio(a,b); """ OPTIONS (library="gs://fh-bigquery/js/fuzzball.umd.min.js");

select showOffset,
fhoffa.x.levenshtein(LastOCRText, OCRText) Levenshtein,
fuzz_tokensetratio(LastOCRText, OCRText) Fuzz_TokenSetRatio,
fuzz_tokensortratio(LastOCRText, OCRText) Fuzz_TokenSortRatio,
fuzz_partialratio(LastOCRText, OCRText) Fuzz_PartialRatio,
from (
  SELECT showOffset, LAG(OCRText) OVER (PARTITION BY iaShowId order by showOffset asc) LastOCRText, OCRText FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) = "2020-02-02" and iaShowId='KGO_20200202_000000_ABC_World_News_Tonight_With_David_Muir' and numShotChanges>1 order by date asc 

Each of the four metrics computes similarity slightly differently, offering different views onto the data! A closer look at the results suggests with a bit of tweaking, this could rapidly triage a broadcast into component stories!

You can see the final results in the output spreadsheet.