Global Similarity Graph Document Embeddings & BigQuery UDFs: Semantic Multilingual Search Over The News

The new Global Similarity Graph Document Embeddings dataset uses the Universal Sentence Encoder V4 to compute document-level embeddings for each news article we monitor in realtime across 65 languages using machine translation. Since each document is represented as an immutable 512-dimension vector, we can semantically search it by translating a natural language human query into the same embedding space and then compare its cosine similarity against every article's embedding to identify the most similar coverage. Of course, a production search application would not perform such a brute-force search at query time – it would preindex content using locality hashing or other approaches, but for the purposes of demonstration, this brute-force approach yields a gold standard result set.

First, we have to convert our query into the USE vector space. Create a free new Colab notebook and run the following code:

#load libraries...
import tensorflow_hub as hub
import tensorflow as tf
!pip install tensorflow_text
import tensorflow_text as text # Needed for loading universal-sentence-encoder-cmlm/multilingual-preprocess
import numpy as np

#normalize...
def normalization(embeds):
norms = np.linalg.norm(embeds, 2, axis=1, keepdims=True)
return embeds/norms

sent = tf.constant(["vaccine blood clots"])
embed_use = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
sente = embed_use(sent)
sente = normalization(sente)
print(repr(sente))

It loads the necessary libraries and converts our query "vaccine blood clots" into the USE vector representation, printing a 512-dimension array:

<tf.Tensor: shape=(1, 512), dtype=float32, numpy= array([[ 0.02818347, 0.02490096, 0.02558763, 0.05044514, -0.03208419, -0.0747378 , 0.05282925, -0.03030587, -0.06323209, 0.03501046, 0.07779447, -0.00064674, -0.01432726, -0.04528232, -0.00013452, 0.01319955, -0.07789714, -0.07262603, -0.04428149, -0.06154352, 0.02736403, 0.05429258, 0.00757298, 0.02407369, 0.06519737, 0.06762105, 0.06284684, -0.04060154, 0.00296181, -0.00768548, -0.06485997, 0.07785969, -0.06720406, 0.04895552, -0.00339502, -0.04219737, -0.06142973, 0.05685892, -0.05649175, 0.00050146, 0.03036945, 0.0344508 , -0.06390501, -0.05346889, 0.02882787, 0.0236582 , -0.05156105, 0.03758312, -0.06801498, 0.04185435, -0.05849047, 0.03997547, 0.00219149, 0.03601582, -0.06715488, -0.01793374, 0.04198214, 0.02073287, 0.02610434, -0.04187581, -0.05638166, -0.03530644, -0.0397791 , 0.03841254, 0.04033826, -0.04213821, -0.04512496, 0.0199297 , -0.02139533, -0.00252358, -0.01885297, -0.01272553, -0.06285991, 0.01819749, 0.04675587, 0.01443767, 0.06732985, -0.00535832, -0.05227207, -0.01604555, 0.04312786, -0.02390673, -0.05696457, 0.05386935, -0.04538085, -0.04626585, 0.03936378, -0.0427114 , -0.05459984, -0.06518955, -0.01438342, -0.00492483, 0.01457008, 0.03740922, -0.0532975 , -0.00163596, 0.02512594, -0.01469385, 0.00970429, 0.07029063, 0.0614634 , -0.01581539, 0.03156977, 0.00409102, 0.00903686, 0.03794029, 0.07164939, -0.07418469, -0.04830661, -0.01725078, 0.01502168, 0.04706262, 0.01221037, -0.05155681, 0.01581833, -0.03036154, -0.07328957, 0.03214511, -0.01459734, -0.01128118, -0.06304299, -0.0584773 , -0.06644399, 0.05438589, -0.03069086, 0.03843722, -0.06924088, 0.05682436, -0.04269011, 0.00858494, 0.02617082, 0.07789769, 0.05098371, -0.02491225, -0.06794736, 0.06156508, 0.00908578, -0.03028868, 0.03840445, 0.02134955, 0.03145362, -0.01892534, -0.02466874, -0.02372607, 0.02861221, 0.03111894, -0.03854304, -0.0773613 , -0.02113795, 0.02595439, 0.04956857, -0.0180923 , 0.02314183, -0.06816887, 0.06035572, -0.05767199, -0.07086307, -0.04779347, -0.00144571, 0.06937918, -0.01462709, -0.02401111, 0.05321732, -0.01356249, 0.00459683, -0.02109838, -0.06687555, -0.0690091 , -0.00606009, 0.06245927, 0.02045952, -0.06919497, -0.00014548, 0.04921691, -0.05208977, -0.05162445, 0.0299241 , 0.04238597, -0.02582334, -0.06225044, -0.0520192 , -0.01448018, 0.04463071, 0.02302869, 0.04366998, -0.01353321, -0.04836373, 0.04309433, -0.05087637, -0.03325786, -0.07186168, -0.07468725, 0.00795705, 0.04991673, 0.01285803, 0.00494267, -0.0638302 , -0.02696754, 0.00817872, 0.06097675, -0.00566234, 0.00751329, -0.04395932, 0.04099512, -0.05165969, -0.03580604, -0.01097626, 0.06958028, -0.07252809, -0.02024011, -0.0524432 , -0.04691889, -0.01265663, 0.00021553, -0.03097472, -0.0273429 , 0.03665897, -0.02845487, -0.03754009, 0.07597657, -0.05234715, 0.01551333, 0.06670195, -0.00533213, 0.03951516, -0.03683328, 0.03466643, 0.05461149, -0.04580925, -0.01130309, -0.05620241, -0.05523331, 0.04891247, 0.06945612, 0.05329306, 0.03032769, -0.02979039, 0.0025692 , 0.02193681, -0.06778944, -0.00227038, -0.05235931, 0.04479021, -0.04634066, 0.03355216, 0.01576688, -0.04026456, -0.03271901, -0.01752607, -0.07638648, 0.04337133, -0.05835104, -0.02573615, 0.04349405, -0.02281209, 0.01962399, 0.05747679, -0.05053102, 0.01988601, -0.05545291, -0.05873819, 0.03166746, 0.04656564, 0.0153602 , 0.01361886, -0.06163072, 0.05065327, 0.07169119, 0.06954852, 0.06025947, 0.05153174, -0.07366416, -0.0365998 , -0.04009761, 0.05625142, 0.06446043, 0.03201801, 0.02971405, 0.0737622 , 0.06616847, 0.00582288, -0.0525829 , -0.04169655, -0.01577994, 0.01071025, -0.05016847, -0.04691961, -0.0317309 , -0.05048965, 0.05053896, 0.07650524, -0.01776609, 0.06789573, -0.00531551, -0.04188078, 0.04051043, 0.0736462 , 0.06111282, -0.01588431, 0.06295784, -0.04965852, -0.06756961, -0.00998105, 0.00980487, 0.00649424, 0.02120406, -0.07384725, 0.02591657, -0.04592149, -0.05354767, 0.06983862, 0.03068741, -0.04121724, -0.06307439, 0.06602945, 0.05159619, 0.00031473, 0.06603036, 0.05710867, -0.07667553, 0.01808936, -0.00217794, 0.03378501, -0.02782313, -0.0585771 , 0.07633144, 0.05268069, 0.0635586 , 0.05886192, 0.02527534, -0.00726158, -0.04635018, -0.02846958, 0.06894683, 0.03781663, 0.04901429, -0.01094481, -0.02215855, -0.02272323, 0.04048209, -0.01064503, 0.05737942, 0.07013195, -0.02945651, 0.04803978, -0.07765157, 0.02616347, 0.04886578, -0.04843407, 0.05140765, 0.06862908, -0.02428162, -0.00615881, -0.01507739, 0.02190077, 0.02378325, -0.07355367, 0.0724471 , -0.01313125, -0.02851717, 0.02268659, -0.03548731, -0.04809987, 0.05358881, 0.02318889, -0.05063765, 0.06249962, 0.01871757, -0.00529746, 0.05262776, -0.067226 , 0.02853401, -0.01622482, -0.07752634, 0.03784851, -0.00392051, -0.01120823, -0.04157882, 0.04765187, -0.02162239, 0.0558276 , -0.03292911, 0.0056406 , 0.0571976 , -0.02646085, 0.00437396, 0.0516505 , -0.04328375, 0.03608196, 0.05058712, -0.01735051, -0.06220594, -0.01035582, 0.02820573, -0.06567286, 0.04494439, -0.04865711, 0.03783672, -0.00416228, -0.05124124, -0.05889187, 0.0672591 , -0.05184856, -0.03336031, -0.00189231, 0.04726206, -0.0611569 , -0.00453743, -0.0029412 , -0.05767642, 0.05269921, 0.02825682, -0.01825115, -0.06266699, 0.06990503, 0.05130588, 0.07483746, 0.03357929, -0.0204674 , 0.05995376, 0.02700124, 0.00525981, 0.04424716, 0.00055878, -0.04075001, -0.01280485, -0.04521654, 0.01661577, 0.02164675, 0.05205575, -0.00765765, -0.01064626, 0.06603251, 0.04269373, -0.00468247, 0.008081 , 0.01047275, 0.0424873 , 0.03780128, -0.01339662, 0.04674398, 0.02245212, -0.01063377, 0.04146469, -0.04531259, -0.03408718, 0.02279444, -0.05073286, 0.04562168, 0.06423643, 0.06099452, -0.02337616, -0.01288495, -0.06582094, -0.01495991, -0.01404245, -0.00909756, 0.05820715, -0.01316472, -0.03580612, -0.06935831, 0.03990944, 0.02947431, -0.03525035, 0.0144262 , -0.03238249, 0.05131285, 0.02010029, -0.04254121, 0.05531391, -0.0467257 , 0.01790263, 0.05152426, 0.05766447, 0.01373304, -0.04750573, -0.05672399, 0.0641381 , -0.04198284, -0.01444486, 0.01031578, -0.05345995, 0.05150107, 0.0200979 , -0.0052268 , -0.03236465, -0.04926898, 0.06556525, 0.04177241, 0.00885976, 0.00389759, 0.06858263, 0.0283224 , -0.05872786, 0.02647729, -0.01962265, -0.02361701, -0.04401072, -0.01043554, 0.07544696, 0.04094814, 0.01860246, 0.05765727, -0.06587842, 0.05198153, -0.07789869, 0.03400592, -0.06168717, 0.02568811, -0.04552713, -0.05306999, -0.02551358]], dtype=float32)>

Now we need to compute the cosine similarity of this vector against all of the articles in the Global Similarity Graph Document Embeddings dataset. First, we'll modify this pure-SQL example from the Google Cloud Architecture Center (to search for a different query, just run the code above in Codelab, replacing the "vaccine blood clots" string with your own query and then replace the vector below with the results:

WITH data AS (
select [0.02818347, 0.02490096, 0.02558763, 0.05044514, -0.03208419,
-0.0747378 , 0.05282925, -0.03030587, -0.06323209, 0.03501046,
0.07779447, -0.00064674, -0.01432726, -0.04528232, -0.00013452,
0.01319955, -0.07789714, -0.07262603, -0.04428149, -0.06154352,
0.02736403, 0.05429258, 0.00757298, 0.02407369, 0.06519737,
0.06762105, 0.06284684, -0.04060154, 0.00296181, -0.00768548,
-0.06485997, 0.07785969, -0.06720406, 0.04895552, -0.00339502,
-0.04219737, -0.06142973, 0.05685892, -0.05649175, 0.00050146,
0.03036945, 0.0344508 , -0.06390501, -0.05346889, 0.02882787,
0.0236582 , -0.05156105, 0.03758312, -0.06801498, 0.04185435,
-0.05849047, 0.03997547, 0.00219149, 0.03601582, -0.06715488,
-0.01793374, 0.04198214, 0.02073287, 0.02610434, -0.04187581,
-0.05638166, -0.03530644, -0.0397791 , 0.03841254, 0.04033826,
-0.04213821, -0.04512496, 0.0199297 , -0.02139533, -0.00252358,
-0.01885297, -0.01272553, -0.06285991, 0.01819749, 0.04675587,
0.01443767, 0.06732985, -0.00535832, -0.05227207, -0.01604555,
0.04312786, -0.02390673, -0.05696457, 0.05386935, -0.04538085,
-0.04626585, 0.03936378, -0.0427114 , -0.05459984, -0.06518955,
-0.01438342, -0.00492483, 0.01457008, 0.03740922, -0.0532975 ,
-0.00163596, 0.02512594, -0.01469385, 0.00970429, 0.07029063,
0.0614634 , -0.01581539, 0.03156977, 0.00409102, 0.00903686,
0.03794029, 0.07164939, -0.07418469, -0.04830661, -0.01725078,
0.01502168, 0.04706262, 0.01221037, -0.05155681, 0.01581833,
-0.03036154, -0.07328957, 0.03214511, -0.01459734, -0.01128118,
-0.06304299, -0.0584773 , -0.06644399, 0.05438589, -0.03069086,
0.03843722, -0.06924088, 0.05682436, -0.04269011, 0.00858494,
0.02617082, 0.07789769, 0.05098371, -0.02491225, -0.06794736,
0.06156508, 0.00908578, -0.03028868, 0.03840445, 0.02134955,
0.03145362, -0.01892534, -0.02466874, -0.02372607, 0.02861221,
0.03111894, -0.03854304, -0.0773613 , -0.02113795, 0.02595439,
0.04956857, -0.0180923 , 0.02314183, -0.06816887, 0.06035572,
-0.05767199, -0.07086307, -0.04779347, -0.00144571, 0.06937918,
-0.01462709, -0.02401111, 0.05321732, -0.01356249, 0.00459683,
-0.02109838, -0.06687555, -0.0690091 , -0.00606009, 0.06245927,
0.02045952, -0.06919497, -0.00014548, 0.04921691, -0.05208977,
-0.05162445, 0.0299241 , 0.04238597, -0.02582334, -0.06225044,
-0.0520192 , -0.01448018, 0.04463071, 0.02302869, 0.04366998,
-0.01353321, -0.04836373, 0.04309433, -0.05087637, -0.03325786,
-0.07186168, -0.07468725, 0.00795705, 0.04991673, 0.01285803,
0.00494267, -0.0638302 , -0.02696754, 0.00817872, 0.06097675,
-0.00566234, 0.00751329, -0.04395932, 0.04099512, -0.05165969,
-0.03580604, -0.01097626, 0.06958028, -0.07252809, -0.02024011,
-0.0524432 , -0.04691889, -0.01265663, 0.00021553, -0.03097472,
-0.0273429 , 0.03665897, -0.02845487, -0.03754009, 0.07597657,
-0.05234715, 0.01551333, 0.06670195, -0.00533213, 0.03951516,
-0.03683328, 0.03466643, 0.05461149, -0.04580925, -0.01130309,
-0.05620241, -0.05523331, 0.04891247, 0.06945612, 0.05329306,
0.03032769, -0.02979039, 0.0025692 , 0.02193681, -0.06778944,
-0.00227038, -0.05235931, 0.04479021, -0.04634066, 0.03355216,
0.01576688, -0.04026456, -0.03271901, -0.01752607, -0.07638648,
0.04337133, -0.05835104, -0.02573615, 0.04349405, -0.02281209,
0.01962399, 0.05747679, -0.05053102, 0.01988601, -0.05545291,
-0.05873819, 0.03166746, 0.04656564, 0.0153602 , 0.01361886,
-0.06163072, 0.05065327, 0.07169119, 0.06954852, 0.06025947,
0.05153174, -0.07366416, -0.0365998 , -0.04009761, 0.05625142,
0.06446043, 0.03201801, 0.02971405, 0.0737622 , 0.06616847,
0.00582288, -0.0525829 , -0.04169655, -0.01577994, 0.01071025,
-0.05016847, -0.04691961, -0.0317309 , -0.05048965, 0.05053896,
0.07650524, -0.01776609, 0.06789573, -0.00531551, -0.04188078,
0.04051043, 0.0736462 , 0.06111282, -0.01588431, 0.06295784,
-0.04965852, -0.06756961, -0.00998105, 0.00980487, 0.00649424,
0.02120406, -0.07384725, 0.02591657, -0.04592149, -0.05354767,
0.06983862, 0.03068741, -0.04121724, -0.06307439, 0.06602945,
0.05159619, 0.00031473, 0.06603036, 0.05710867, -0.07667553,
0.01808936, -0.00217794, 0.03378501, -0.02782313, -0.0585771 ,
0.07633144, 0.05268069, 0.0635586 , 0.05886192, 0.02527534,
-0.00726158, -0.04635018, -0.02846958, 0.06894683, 0.03781663,
0.04901429, -0.01094481, -0.02215855, -0.02272323, 0.04048209,
-0.01064503, 0.05737942, 0.07013195, -0.02945651, 0.04803978,
-0.07765157, 0.02616347, 0.04886578, -0.04843407, 0.05140765,
0.06862908, -0.02428162, -0.00615881, -0.01507739, 0.02190077,
0.02378325, -0.07355367, 0.0724471 , -0.01313125, -0.02851717,
0.02268659, -0.03548731, -0.04809987, 0.05358881, 0.02318889,
-0.05063765, 0.06249962, 0.01871757, -0.00529746, 0.05262776,
-0.067226 , 0.02853401, -0.01622482, -0.07752634, 0.03784851,
-0.00392051, -0.01120823, -0.04157882, 0.04765187, -0.02162239,
0.0558276 , -0.03292911, 0.0056406 , 0.0571976 , -0.02646085,
0.00437396, 0.0516505 , -0.04328375, 0.03608196, 0.05058712,
-0.01735051, -0.06220594, -0.01035582, 0.02820573, -0.06567286,
0.04494439, -0.04865711, 0.03783672, -0.00416228, -0.05124124,
-0.05889187, 0.0672591 , -0.05184856, -0.03336031, -0.00189231,
0.04726206, -0.0611569 , -0.00453743, -0.0029412 , -0.05767642,
0.05269921, 0.02825682, -0.01825115, -0.06266699, 0.06990503,
0.05130588, 0.07483746, 0.03357929, -0.0204674 , 0.05995376,
0.02700124, 0.00525981, 0.04424716, 0.00055878, -0.04075001,
-0.01280485, -0.04521654, 0.01661577, 0.02164675, 0.05205575,
-0.00765765, -0.01064626, 0.06603251, 0.04269373, -0.00468247,
0.008081 , 0.01047275, 0.0424873 , 0.03780128, -0.01339662,
0.04674398, 0.02245212, -0.01063377, 0.04146469, -0.04531259,
-0.03408718, 0.02279444, -0.05073286, 0.04562168, 0.06423643,
0.06099452, -0.02337616, -0.01288495, -0.06582094, -0.01495991,
-0.01404245, -0.00909756, 0.05820715, -0.01316472, -0.03580612,
-0.06935831, 0.03990944, 0.02947431, -0.03525035, 0.0144262 ,
-0.03238249, 0.05131285, 0.02010029, -0.04254121, 0.05531391,
-0.0467257 , 0.01790263, 0.05152426, 0.05766447, 0.01373304,
-0.04750573, -0.05672399, 0.0641381 , -0.04198284, -0.01444486,
0.01031578, -0.05345995, 0.05150107, 0.0200979 , -0.0052268 ,
-0.03236465, -0.04926898, 0.06556525, 0.04177241, 0.00885976,
0.00389759, 0.06858263, 0.0283224 , -0.05872786, 0.02647729,
-0.01962265, -0.02361701, -0.04401072, -0.01043554, 0.07544696,
0.04094814, 0.01860246, 0.05765727, -0.06587842, 0.05198153,
-0.07789869, 0.03400592, -0.06168717, 0.02568811, -0.04552713,
-0.05306999, -0.02551358] as docembed
)
SELECT
c.k2 as match_title,
SUM(vv1*vv2) / (SQRT(SUM(POW(vv1,2))) * SQRT(SUM(POW(vv2,2)))) AS similarity,
ANY_VALUE(c.u2) as match_url
FROM
(
SELECT
a.key k1, a.val v1, b.key k2, b.val v2, a.url u1, b.url u2
FROM
(
SELECT '' key, 'query' url, docembed val FROM data limit 1
) a
CROSS JOIN
(
SELECT title key, url url, docembed val FROM `gdelt-bq.gdeltv2.gsg_docembed` WHERE DATE(date) = "2021-07-30" 
) b
) c
, UNNEST(c.v1) vv1 with offset ind1 JOIN UNNEST(c.v2) vv2 with offset ind2 ON (ind1=ind2)
GROUP BY c.k1, c.k2
ORDER BY similarity DESC
LIMIT 100

This query takes 21 minutes to complete and yields the following results:

 

Row match_title similarity match_url
1
Risk of blood clots in Pfizer COVID-19 vaccine as likely as AstraZeneca jab: Study
0.5096007905750278
https://freerepublic.com/focus/f-bloggers/3980674/posts
2
Manitoba sends back over 5,000 AstraZeneca vaccines, slowing supersites – Classic107: Winnipeg's only dedicated classical and jazz radio station.
0.4431825337510251
https://classic107.com/articles/manitoba-sends-back-over-5000-astrazeneca-vaccines-slowing-supersites
3
Manitoba sends back over 5,000 AstraZeneca vaccines, slowing supersites – CHVNRadio: Southern Manitoba's hub for local and Christian news, and adult contemporary Christian programming.
0.4431825337510251
https://www.chvnradio.com/articles/manitoba-sends-back-over-5000-astrazeneca-vaccines-slowing-supersites
4
الصحة اليابانية توافق على الاستخدام المحلي "استرازينيكا" البريطاني
0.39529954016380403
https://www.elbalad.news/4907638
5
Warning issued over vaccine appointment scam
0.3915273355127886
https://www.rte.ie/news/coronavirus/2021/0730/1238296-vaccine-scam/
6
Torrington Area Health District outreach staff urge parents to update immunizations
0.3866629025599832
https://www.registercitizen.com/news/article/Torrington-Area-Health-District-outreach-staff-16353387.php
7
HPV Vaccination and Cancer Prevention
0.3840681933710871
https://www.cancer.org/healthy/hpv-vaccine.html
8
Manitoba sends 5,500 doses of AstraZeneca-Oxford vaccine back to Ottawa
0.3788478573936616
https://www.cbc.ca/news/canada/manitoba/astra-zeneca-manitoba-returned-covid-19-1.6124203
9
Statystyki szczepień Covid-19 w Polsce 30.07.2021
0.37761767354597
https://dziennikbaltycki.pl/statystyki-szczepien-covid-19-w-polsce-30072021/ar/c14p1-21791559
10
Szczepienia w Krakowie 30.07.2021. Ile jest zaszczepionych osób przeciwko koronawirusowi?
0.3772241320142252
https://krakow.naszemiasto.pl/szczepienia-w-krakowie-30072021-ile-jest-zaszczepionych-osob-przeciwko-koronawirusowi/ar/c14p1-21786221
11
Szczepienia przeciwko koronawirusowi w Olsztynie 30.07.2021
0.37299377586439314
https://olsztyn.naszemiasto.pl/szczepienia-przeciwko-koronawirusowi-w-olsztynie-30072021/ar/c14p1-21786279
12
#EndorseThis: Watch Former Anti-Vaxxers Who Survived COVID Plead For Sanity
0.3698139804191956
https://www.nationalmemo.com/anti-vaxxer-regret
13
Roscommon Herald — Warning over text scam for Covid vaccine appointments
0.36979538041565135
https://roscommonherald.ie/2021/07/30/warning-over-text-scam-for-covid-vaccine-appointments/
14
Carlow Nationalist — Warning over text scam for Covid vaccine appointments
0.36979538041565135
https://carlow-nationalist.ie/2021/07/30/warning-over-text-scam-for-covid-vaccine-appointments/
15
Laois Nationalist — Warning over text scam for Covid vaccine appointments
0.36979538041565135
https://laois-nationalist.ie/2021/07/30/warning-over-text-scam-for-covid-vaccine-appointments/
16
Waterford News and Star — Warning over text scam for Covid vaccine appointments
0.36979538041565135
https://waterford-news.ie/2021/07/30/warning-over-text-scam-for-covid-vaccine-appointments/
17
Kildare Nationalist — Warning over text scam for Covid vaccine appointments
0.36979538041565135
https://kildare-nationalist.ie/2021/07/30/warning-over-text-scam-for-covid-vaccine-appointments/
18
Вакцина "Спутник Лайт" поступила в 109 прививочных пунктов в Петербурге
0.36977288199435243
https://www.dp.ru/a/2021/07/30/Vakcina_Sputnik_Lajt_po?hash=775837
19
Szczepienia w Warszawie 30.07.2021. Ile jest zaszczepionych osób przeciwko koronawirusowi?
0.36926685371966017
https://warszawa.naszemiasto.pl/szczepienia-w-warszawie-30072021-ile-jest-zaszczepionych-osob-przeciwko-koronawirusowi/ar/c14p1-21786219
20
Szczepienia we Wrocławiu 30.07.2021. Jak wygląda sytuacja ze szczepieniami przeciwko koronawirusowi w Twoim powiecie?
0.3658809090640407
https://wroclaw.naszemiasto.pl/szczepienia-we-wroclawiu-30072021-jak-wyglada-sytuacja-ze-szczepieniami-przeciwko-koronawirusowi-w-twoim-powiecie/ar/c14p1-21786227

Note that several of the results above are in languages other than English, reflecting the potency of combining machine translation with monolingual document-level embeddings. Note in particular that some of the results only mention blood clots later in the text, rather than in the lead paragraph, reflecting the importance of document-level embeddings over traditional "lead+last" paragraph embeddings.

The low similarity scores reflect the fact that on this particular day (July 30, 2021), there were few articles about blood clots and thus articles about vaccination without mentioning blood clots are returned – these would typically be filtered out by thresholding the similarity scores, but we've left them in for this example.

The GSG Document Embeddings dataset was launched late in the day on July 30th, so there are only 111,619 documents on that particular day. Despite this low number of documents, the query above takes 21 minutes to return. This is because it flattens the document embeddings to be able to process them in native SQL. Could we speed this up by keeping them as native arrays?

BigQuery supports User Defined Functions written in JavaScript, which would allow us to retain our embeddings as arrays, drastically reducing the pressure on the Join stage of the query. The resulting SQL becomes vastly simpler as well:

CREATE TEMPORARY FUNCTION cossim(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>)
RETURNS FLOAT64 LANGUAGE js AS '''
var sumt=0, suma=0, sumb=0;
for(i=0;i<a.length;i++) {
sumt += (a[i]*b[i]);
suma += (a[i]*a[i]);
sumb += (b[i]*b[i]);
}
suma = Math.sqrt(suma);
sumb = Math.sqrt(sumb);
return sumt/(suma*sumb);
''';

WITH query AS (
select [0.02818347, 0.02490096, 0.02558763, 0.05044514, -0.03208419,
-0.0747378 , 0.05282925, -0.03030587, -0.06323209, 0.03501046,
0.07779447, -0.00064674, -0.01432726, -0.04528232, -0.00013452,
0.01319955, -0.07789714, -0.07262603, -0.04428149, -0.06154352,
0.02736403, 0.05429258, 0.00757298, 0.02407369, 0.06519737,
0.06762105, 0.06284684, -0.04060154, 0.00296181, -0.00768548,
-0.06485997, 0.07785969, -0.06720406, 0.04895552, -0.00339502,
-0.04219737, -0.06142973, 0.05685892, -0.05649175, 0.00050146,
0.03036945, 0.0344508 , -0.06390501, -0.05346889, 0.02882787,
0.0236582 , -0.05156105, 0.03758312, -0.06801498, 0.04185435,
-0.05849047, 0.03997547, 0.00219149, 0.03601582, -0.06715488,
-0.01793374, 0.04198214, 0.02073287, 0.02610434, -0.04187581,
-0.05638166, -0.03530644, -0.0397791 , 0.03841254, 0.04033826,
-0.04213821, -0.04512496, 0.0199297 , -0.02139533, -0.00252358,
-0.01885297, -0.01272553, -0.06285991, 0.01819749, 0.04675587,
0.01443767, 0.06732985, -0.00535832, -0.05227207, -0.01604555,
0.04312786, -0.02390673, -0.05696457, 0.05386935, -0.04538085,
-0.04626585, 0.03936378, -0.0427114 , -0.05459984, -0.06518955,
-0.01438342, -0.00492483, 0.01457008, 0.03740922, -0.0532975 ,
-0.00163596, 0.02512594, -0.01469385, 0.00970429, 0.07029063,
0.0614634 , -0.01581539, 0.03156977, 0.00409102, 0.00903686,
0.03794029, 0.07164939, -0.07418469, -0.04830661, -0.01725078,
0.01502168, 0.04706262, 0.01221037, -0.05155681, 0.01581833,
-0.03036154, -0.07328957, 0.03214511, -0.01459734, -0.01128118,
-0.06304299, -0.0584773 , -0.06644399, 0.05438589, -0.03069086,
0.03843722, -0.06924088, 0.05682436, -0.04269011, 0.00858494,
0.02617082, 0.07789769, 0.05098371, -0.02491225, -0.06794736,
0.06156508, 0.00908578, -0.03028868, 0.03840445, 0.02134955,
0.03145362, -0.01892534, -0.02466874, -0.02372607, 0.02861221,
0.03111894, -0.03854304, -0.0773613 , -0.02113795, 0.02595439,
0.04956857, -0.0180923 , 0.02314183, -0.06816887, 0.06035572,
-0.05767199, -0.07086307, -0.04779347, -0.00144571, 0.06937918,
-0.01462709, -0.02401111, 0.05321732, -0.01356249, 0.00459683,
-0.02109838, -0.06687555, -0.0690091 , -0.00606009, 0.06245927,
0.02045952, -0.06919497, -0.00014548, 0.04921691, -0.05208977,
-0.05162445, 0.0299241 , 0.04238597, -0.02582334, -0.06225044,
-0.0520192 , -0.01448018, 0.04463071, 0.02302869, 0.04366998,
-0.01353321, -0.04836373, 0.04309433, -0.05087637, -0.03325786,
-0.07186168, -0.07468725, 0.00795705, 0.04991673, 0.01285803,
0.00494267, -0.0638302 , -0.02696754, 0.00817872, 0.06097675,
-0.00566234, 0.00751329, -0.04395932, 0.04099512, -0.05165969,
-0.03580604, -0.01097626, 0.06958028, -0.07252809, -0.02024011,
-0.0524432 , -0.04691889, -0.01265663, 0.00021553, -0.03097472,
-0.0273429 , 0.03665897, -0.02845487, -0.03754009, 0.07597657,
-0.05234715, 0.01551333, 0.06670195, -0.00533213, 0.03951516,
-0.03683328, 0.03466643, 0.05461149, -0.04580925, -0.01130309,
-0.05620241, -0.05523331, 0.04891247, 0.06945612, 0.05329306,
0.03032769, -0.02979039, 0.0025692 , 0.02193681, -0.06778944,
-0.00227038, -0.05235931, 0.04479021, -0.04634066, 0.03355216,
0.01576688, -0.04026456, -0.03271901, -0.01752607, -0.07638648,
0.04337133, -0.05835104, -0.02573615, 0.04349405, -0.02281209,
0.01962399, 0.05747679, -0.05053102, 0.01988601, -0.05545291,
-0.05873819, 0.03166746, 0.04656564, 0.0153602 , 0.01361886,
-0.06163072, 0.05065327, 0.07169119, 0.06954852, 0.06025947,
0.05153174, -0.07366416, -0.0365998 , -0.04009761, 0.05625142,
0.06446043, 0.03201801, 0.02971405, 0.0737622 , 0.06616847,
0.00582288, -0.0525829 , -0.04169655, -0.01577994, 0.01071025,
-0.05016847, -0.04691961, -0.0317309 , -0.05048965, 0.05053896,
0.07650524, -0.01776609, 0.06789573, -0.00531551, -0.04188078,
0.04051043, 0.0736462 , 0.06111282, -0.01588431, 0.06295784,
-0.04965852, -0.06756961, -0.00998105, 0.00980487, 0.00649424,
0.02120406, -0.07384725, 0.02591657, -0.04592149, -0.05354767,
0.06983862, 0.03068741, -0.04121724, -0.06307439, 0.06602945,
0.05159619, 0.00031473, 0.06603036, 0.05710867, -0.07667553,
0.01808936, -0.00217794, 0.03378501, -0.02782313, -0.0585771 ,
0.07633144, 0.05268069, 0.0635586 , 0.05886192, 0.02527534,
-0.00726158, -0.04635018, -0.02846958, 0.06894683, 0.03781663,
0.04901429, -0.01094481, -0.02215855, -0.02272323, 0.04048209,
-0.01064503, 0.05737942, 0.07013195, -0.02945651, 0.04803978,
-0.07765157, 0.02616347, 0.04886578, -0.04843407, 0.05140765,
0.06862908, -0.02428162, -0.00615881, -0.01507739, 0.02190077,
0.02378325, -0.07355367, 0.0724471 , -0.01313125, -0.02851717,
0.02268659, -0.03548731, -0.04809987, 0.05358881, 0.02318889,
-0.05063765, 0.06249962, 0.01871757, -0.00529746, 0.05262776,
-0.067226 , 0.02853401, -0.01622482, -0.07752634, 0.03784851,
-0.00392051, -0.01120823, -0.04157882, 0.04765187, -0.02162239,
0.0558276 , -0.03292911, 0.0056406 , 0.0571976 , -0.02646085,
0.00437396, 0.0516505 , -0.04328375, 0.03608196, 0.05058712,
-0.01735051, -0.06220594, -0.01035582, 0.02820573, -0.06567286,
0.04494439, -0.04865711, 0.03783672, -0.00416228, -0.05124124,
-0.05889187, 0.0672591 , -0.05184856, -0.03336031, -0.00189231,
0.04726206, -0.0611569 , -0.00453743, -0.0029412 , -0.05767642,
0.05269921, 0.02825682, -0.01825115, -0.06266699, 0.06990503,
0.05130588, 0.07483746, 0.03357929, -0.0204674 , 0.05995376,
0.02700124, 0.00525981, 0.04424716, 0.00055878, -0.04075001,
-0.01280485, -0.04521654, 0.01661577, 0.02164675, 0.05205575,
-0.00765765, -0.01064626, 0.06603251, 0.04269373, -0.00468247,
0.008081 , 0.01047275, 0.0424873 , 0.03780128, -0.01339662,
0.04674398, 0.02245212, -0.01063377, 0.04146469, -0.04531259,
-0.03408718, 0.02279444, -0.05073286, 0.04562168, 0.06423643,
0.06099452, -0.02337616, -0.01288495, -0.06582094, -0.01495991,
-0.01404245, -0.00909756, 0.05820715, -0.01316472, -0.03580612,
-0.06935831, 0.03990944, 0.02947431, -0.03525035, 0.0144262 ,
-0.03238249, 0.05131285, 0.02010029, -0.04254121, 0.05531391,
-0.0467257 , 0.01790263, 0.05152426, 0.05766447, 0.01373304,
-0.04750573, -0.05672399, 0.0641381 , -0.04198284, -0.01444486,
0.01031578, -0.05345995, 0.05150107, 0.0200979 , -0.0052268 ,
-0.03236465, -0.04926898, 0.06556525, 0.04177241, 0.00885976,
0.00389759, 0.06858263, 0.0283224 , -0.05872786, 0.02647729,
-0.01962265, -0.02361701, -0.04401072, -0.01043554, 0.07544696,
0.04094814, 0.01860246, 0.05765727, -0.06587842, 0.05198153,
-0.07789869, 0.03400592, -0.06168717, 0.02568811, -0.04552713,
-0.05306999, -0.02551358] as docembed
)
SELECT cossim(doc.docembed, query.docembed) sim, title, url FROM `gdelt-bq.gdeltv2.gsg_docembed` doc, query WHERE DATE(date) = "2021-07-30" order by sim desc limit 100

Here, we simply define our cosine similarity computation as a UDF and then invoke it via a trivial one-line SQL statement at query stage (the bulk of the SQL above is simply the copy-pasted query vector embedding). This query takes just 46 seconds to complete! Comparing the results, they return identical result lists.

Why is the UDF query so much faster? The execution details for the two queries sheds some light. The native SQL query takes 20 minutes 57 seconds to complete, consumes 1 hour 45 minutes of slot time, shuffles 17.38MB and during the Join stage, workers take 689 seconds average, 1,891 seconds max. In contrast, the UDF-based query takes just 45.7 seconds to complete, consumes just 1 min 51 seconds of worker time, shuffles just 39KB and during the Join stage, workers take 29 seconds average, 82 seconds max.

Of course, 46 second latency is still far too slow for production querying, so a real-world application would use locality hashing or a standalone indexing platform like Vertex Matching Engine, but these queries allow you to see how semantic natural language querying works using embeddings!