Yesterday we showed how a single SQL query in BigQuery can be used to segment a television news show by measuring the Levenshtein edit distance between the OCR'd text of the first second of each shot, using BigQuery's persistent UDF capability. Today we're excited to show how BigQuery's UDF capability can offer even more advanced text similarity metrics, following Felipe Hoffa's UDF examples of applying the fuzzball.js library at BigQuery scale!
Yesterday's example computed similarity as the Levenshtein edit distance, which is not ideal, since if the onscreen text is a subset of the previous text (such as "Latest Covid-19 Updates" and "Latest Covid-19 Updates: The Latest Infection Numbers"), the Levenshtein distance will correctly treat them as distinct strings, whereas for the purpose of story segmentation we'd prefer a metric that doesn't penalize substrings.
It turns out that the fuzzball.js library that provides the Levenshtein edit distance function we used yesterday also has a wealth of additional metrics, including "Partial Ratio" that computes the "Highest scoring substring of the longer string vs. the shorter string" and "Token Sort Ratio" defined as "Tokenized, sorted, and then recombined before scoring" and "Token Set Ratio" defined as "Highest of 3 scores comparing the set intersection, intersection + difference 1 to 2, and intersection + difference 2 to 1."
Invoking these additional metrics is as simple as adding an extra bit of UDF code to the top of our query:
CREATE TEMP FUNCTION fuzz_tokensetratio(a string, b string) RETURNS INT64 LANGUAGE js AS """ return fuzzball.token_set_ratio(a,b); """ OPTIONS (library="gs://fh-bigquery/js/fuzzball.umd.min.js"); CREATE TEMP FUNCTION fuzz_tokensortratio(a string, b string) RETURNS INT64 LANGUAGE js AS """ return fuzzball.token_sort_ratio(a,b); """ OPTIONS (library="gs://fh-bigquery/js/fuzzball.umd.min.js"); CREATE TEMP FUNCTION fuzz_partialratio(a string, b string) RETURNS INT64 LANGUAGE js AS """ return fuzzball.partial_ratio(a,b); """ OPTIONS (library="gs://fh-bigquery/js/fuzzball.umd.min.js"); select showOffset, fhoffa.x.levenshtein(LastOCRText, OCRText) Levenshtein, fuzz_tokensetratio(LastOCRText, OCRText) Fuzz_TokenSetRatio, fuzz_tokensortratio(LastOCRText, OCRText) Fuzz_TokenSortRatio, fuzz_partialratio(LastOCRText, OCRText) Fuzz_PartialRatio, LastOCRText, OCRText from ( SELECT showOffset, LAG(OCRText) OVER (PARTITION BY iaShowId order by showOffset asc) LastOCRText, OCRText FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) = "2020-02-02" and iaShowId='KGO_20200202_000000_ABC_World_News_Tonight_With_David_Muir' and numShotChanges>1 order by date asc )
Each of the four metrics computes similarity slightly differently, offering different views onto the data! A closer look at the results suggests with a bit of tweaking, this could rapidly triage a broadcast into component stories!
You can see the final results in the output spreadsheet.