The new Global Similarity Graph Document Embeddings dataset uses the Universal Sentence Encoder V4 to compute document-level embeddings for each news article we monitor in realtime across 65 languages using machine translation. Since each document is represented as an immutable 512-dimension vector, we can semantically search it by translating a natural language human query into the same embedding space and then compare its cosine similarity against every article's embedding to identify the most similar coverage. Of course, a production search application would not perform such a brute-force search at query time – it would preindex content using locality hashing or other approaches, but for the purposes of demonstration, this brute-force approach yields a gold standard result set.
First, we have to convert our query into the USE vector space. Create a free new Colab notebook and run the following code:
#load libraries... import tensorflow_hub as hub import tensorflow as tf !pip install tensorflow_text import tensorflow_text as text # Needed for loading universal-sentence-encoder-cmlm/multilingual-preprocess import numpy as np #normalize... def normalization(embeds): norms = np.linalg.norm(embeds, 2, axis=1, keepdims=True) return embeds/norms sent = tf.constant(["vaccine blood clots"]) embed_use = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4") sente = embed_use(sent) sente = normalization(sente) print(repr(sente))
It loads the necessary libraries and converts our query "vaccine blood clots" into the USE vector representation, printing a 512-dimension array:
<tf.Tensor: shape=(1, 512), dtype=float32, numpy= array([[ 0.02818347, 0.02490096, 0.02558763, 0.05044514, -0.03208419, -0.0747378 , 0.05282925, -0.03030587, -0.06323209, 0.03501046, 0.07779447, -0.00064674, -0.01432726, -0.04528232, -0.00013452, 0.01319955, -0.07789714, -0.07262603, -0.04428149, -0.06154352, 0.02736403, 0.05429258, 0.00757298, 0.02407369, 0.06519737, 0.06762105, 0.06284684, -0.04060154, 0.00296181, -0.00768548, -0.06485997, 0.07785969, -0.06720406, 0.04895552, -0.00339502, -0.04219737, -0.06142973, 0.05685892, -0.05649175, 0.00050146, 0.03036945, 0.0344508 , -0.06390501, -0.05346889, 0.02882787, 0.0236582 , -0.05156105, 0.03758312, -0.06801498, 0.04185435, -0.05849047, 0.03997547, 0.00219149, 0.03601582, -0.06715488, -0.01793374, 0.04198214, 0.02073287, 0.02610434, -0.04187581, -0.05638166, -0.03530644, -0.0397791 , 0.03841254, 0.04033826, -0.04213821, -0.04512496, 0.0199297 , -0.02139533, -0.00252358, -0.01885297, -0.01272553, -0.06285991, 0.01819749, 0.04675587, 0.01443767, 0.06732985, -0.00535832, -0.05227207, -0.01604555, 0.04312786, -0.02390673, -0.05696457, 0.05386935, -0.04538085, -0.04626585, 0.03936378, -0.0427114 , -0.05459984, -0.06518955, -0.01438342, -0.00492483, 0.01457008, 0.03740922, -0.0532975 , -0.00163596, 0.02512594, -0.01469385, 0.00970429, 0.07029063, 0.0614634 , -0.01581539, 0.03156977, 0.00409102, 0.00903686, 0.03794029, 0.07164939, -0.07418469, -0.04830661, -0.01725078, 0.01502168, 0.04706262, 0.01221037, -0.05155681, 0.01581833, -0.03036154, -0.07328957, 0.03214511, -0.01459734, -0.01128118, -0.06304299, -0.0584773 , -0.06644399, 0.05438589, -0.03069086, 0.03843722, -0.06924088, 0.05682436, -0.04269011, 0.00858494, 0.02617082, 0.07789769, 0.05098371, -0.02491225, -0.06794736, 0.06156508, 0.00908578, -0.03028868, 0.03840445, 0.02134955, 0.03145362, -0.01892534, -0.02466874, -0.02372607, 0.02861221, 0.03111894, -0.03854304, -0.0773613 , -0.02113795, 0.02595439, 0.04956857, -0.0180923 , 0.02314183, -0.06816887, 0.06035572, -0.05767199, -0.07086307, -0.04779347, -0.00144571, 0.06937918, -0.01462709, -0.02401111, 0.05321732, -0.01356249, 0.00459683, -0.02109838, -0.06687555, -0.0690091 , -0.00606009, 0.06245927, 0.02045952, -0.06919497, -0.00014548, 0.04921691, -0.05208977, -0.05162445, 0.0299241 , 0.04238597, -0.02582334, -0.06225044, -0.0520192 , -0.01448018, 0.04463071, 0.02302869, 0.04366998, -0.01353321, -0.04836373, 0.04309433, -0.05087637, -0.03325786, -0.07186168, -0.07468725, 0.00795705, 0.04991673, 0.01285803, 0.00494267, -0.0638302 , -0.02696754, 0.00817872, 0.06097675, -0.00566234, 0.00751329, -0.04395932, 0.04099512, -0.05165969, -0.03580604, -0.01097626, 0.06958028, -0.07252809, -0.02024011, -0.0524432 , -0.04691889, -0.01265663, 0.00021553, -0.03097472, -0.0273429 , 0.03665897, -0.02845487, -0.03754009, 0.07597657, -0.05234715, 0.01551333, 0.06670195, -0.00533213, 0.03951516, -0.03683328, 0.03466643, 0.05461149, -0.04580925, -0.01130309, -0.05620241, -0.05523331, 0.04891247, 0.06945612, 0.05329306, 0.03032769, -0.02979039, 0.0025692 , 0.02193681, -0.06778944, -0.00227038, -0.05235931, 0.04479021, -0.04634066, 0.03355216, 0.01576688, -0.04026456, -0.03271901, -0.01752607, -0.07638648, 0.04337133, -0.05835104, -0.02573615, 0.04349405, -0.02281209, 0.01962399, 0.05747679, -0.05053102, 0.01988601, -0.05545291, -0.05873819, 0.03166746, 0.04656564, 0.0153602 , 0.01361886, -0.06163072, 0.05065327, 0.07169119, 0.06954852, 0.06025947, 0.05153174, -0.07366416, -0.0365998 , -0.04009761, 0.05625142, 0.06446043, 0.03201801, 0.02971405, 0.0737622 , 0.06616847, 0.00582288, -0.0525829 , -0.04169655, -0.01577994, 0.01071025, -0.05016847, -0.04691961, -0.0317309 , -0.05048965, 0.05053896, 0.07650524, -0.01776609, 0.06789573, -0.00531551, -0.04188078, 0.04051043, 0.0736462 , 0.06111282, -0.01588431, 0.06295784, -0.04965852, -0.06756961, -0.00998105, 0.00980487, 0.00649424, 0.02120406, -0.07384725, 0.02591657, -0.04592149, -0.05354767, 0.06983862, 0.03068741, -0.04121724, -0.06307439, 0.06602945, 0.05159619, 0.00031473, 0.06603036, 0.05710867, -0.07667553, 0.01808936, -0.00217794, 0.03378501, -0.02782313, -0.0585771 , 0.07633144, 0.05268069, 0.0635586 , 0.05886192, 0.02527534, -0.00726158, -0.04635018, -0.02846958, 0.06894683, 0.03781663, 0.04901429, -0.01094481, -0.02215855, -0.02272323, 0.04048209, -0.01064503, 0.05737942, 0.07013195, -0.02945651, 0.04803978, -0.07765157, 0.02616347, 0.04886578, -0.04843407, 0.05140765, 0.06862908, -0.02428162, -0.00615881, -0.01507739, 0.02190077, 0.02378325, -0.07355367, 0.0724471 , -0.01313125, -0.02851717, 0.02268659, -0.03548731, -0.04809987, 0.05358881, 0.02318889, -0.05063765, 0.06249962, 0.01871757, -0.00529746, 0.05262776, -0.067226 , 0.02853401, -0.01622482, -0.07752634, 0.03784851, -0.00392051, -0.01120823, -0.04157882, 0.04765187, -0.02162239, 0.0558276 , -0.03292911, 0.0056406 , 0.0571976 , -0.02646085, 0.00437396, 0.0516505 , -0.04328375, 0.03608196, 0.05058712, -0.01735051, -0.06220594, -0.01035582, 0.02820573, -0.06567286, 0.04494439, -0.04865711, 0.03783672, -0.00416228, -0.05124124, -0.05889187, 0.0672591 , -0.05184856, -0.03336031, -0.00189231, 0.04726206, -0.0611569 , -0.00453743, -0.0029412 , -0.05767642, 0.05269921, 0.02825682, -0.01825115, -0.06266699, 0.06990503, 0.05130588, 0.07483746, 0.03357929, -0.0204674 , 0.05995376, 0.02700124, 0.00525981, 0.04424716, 0.00055878, -0.04075001, -0.01280485, -0.04521654, 0.01661577, 0.02164675, 0.05205575, -0.00765765, -0.01064626, 0.06603251, 0.04269373, -0.00468247, 0.008081 , 0.01047275, 0.0424873 , 0.03780128, -0.01339662, 0.04674398, 0.02245212, -0.01063377, 0.04146469, -0.04531259, -0.03408718, 0.02279444, -0.05073286, 0.04562168, 0.06423643, 0.06099452, -0.02337616, -0.01288495, -0.06582094, -0.01495991, -0.01404245, -0.00909756, 0.05820715, -0.01316472, -0.03580612, -0.06935831, 0.03990944, 0.02947431, -0.03525035, 0.0144262 , -0.03238249, 0.05131285, 0.02010029, -0.04254121, 0.05531391, -0.0467257 , 0.01790263, 0.05152426, 0.05766447, 0.01373304, -0.04750573, -0.05672399, 0.0641381 , -0.04198284, -0.01444486, 0.01031578, -0.05345995, 0.05150107, 0.0200979 , -0.0052268 , -0.03236465, -0.04926898, 0.06556525, 0.04177241, 0.00885976, 0.00389759, 0.06858263, 0.0283224 , -0.05872786, 0.02647729, -0.01962265, -0.02361701, -0.04401072, -0.01043554, 0.07544696, 0.04094814, 0.01860246, 0.05765727, -0.06587842, 0.05198153, -0.07789869, 0.03400592, -0.06168717, 0.02568811, -0.04552713, -0.05306999, -0.02551358]], dtype=float32)>
Now we need to compute the cosine similarity of this vector against all of the articles in the Global Similarity Graph Document Embeddings dataset. First, we'll modify this pure-SQL example from the Google Cloud Architecture Center (to search for a different query, just run the code above in Codelab, replacing the "vaccine blood clots" string with your own query and then replace the vector below with the results:
WITH data AS ( select [0.02818347, 0.02490096, 0.02558763, 0.05044514, -0.03208419, -0.0747378 , 0.05282925, -0.03030587, -0.06323209, 0.03501046, 0.07779447, -0.00064674, -0.01432726, -0.04528232, -0.00013452, 0.01319955, -0.07789714, -0.07262603, -0.04428149, -0.06154352, 0.02736403, 0.05429258, 0.00757298, 0.02407369, 0.06519737, 0.06762105, 0.06284684, -0.04060154, 0.00296181, -0.00768548, -0.06485997, 0.07785969, -0.06720406, 0.04895552, -0.00339502, -0.04219737, -0.06142973, 0.05685892, -0.05649175, 0.00050146, 0.03036945, 0.0344508 , -0.06390501, -0.05346889, 0.02882787, 0.0236582 , -0.05156105, 0.03758312, -0.06801498, 0.04185435, -0.05849047, 0.03997547, 0.00219149, 0.03601582, -0.06715488, -0.01793374, 0.04198214, 0.02073287, 0.02610434, -0.04187581, -0.05638166, -0.03530644, -0.0397791 , 0.03841254, 0.04033826, -0.04213821, -0.04512496, 0.0199297 , -0.02139533, -0.00252358, -0.01885297, -0.01272553, -0.06285991, 0.01819749, 0.04675587, 0.01443767, 0.06732985, -0.00535832, -0.05227207, -0.01604555, 0.04312786, -0.02390673, -0.05696457, 0.05386935, -0.04538085, -0.04626585, 0.03936378, -0.0427114 , -0.05459984, -0.06518955, -0.01438342, -0.00492483, 0.01457008, 0.03740922, -0.0532975 , -0.00163596, 0.02512594, -0.01469385, 0.00970429, 0.07029063, 0.0614634 , -0.01581539, 0.03156977, 0.00409102, 0.00903686, 0.03794029, 0.07164939, -0.07418469, -0.04830661, -0.01725078, 0.01502168, 0.04706262, 0.01221037, -0.05155681, 0.01581833, -0.03036154, -0.07328957, 0.03214511, -0.01459734, -0.01128118, -0.06304299, -0.0584773 , -0.06644399, 0.05438589, -0.03069086, 0.03843722, -0.06924088, 0.05682436, -0.04269011, 0.00858494, 0.02617082, 0.07789769, 0.05098371, -0.02491225, -0.06794736, 0.06156508, 0.00908578, -0.03028868, 0.03840445, 0.02134955, 0.03145362, -0.01892534, -0.02466874, -0.02372607, 0.02861221, 0.03111894, -0.03854304, -0.0773613 , -0.02113795, 0.02595439, 0.04956857, -0.0180923 , 0.02314183, -0.06816887, 0.06035572, -0.05767199, -0.07086307, -0.04779347, -0.00144571, 0.06937918, -0.01462709, -0.02401111, 0.05321732, -0.01356249, 0.00459683, -0.02109838, -0.06687555, -0.0690091 , -0.00606009, 0.06245927, 0.02045952, -0.06919497, -0.00014548, 0.04921691, -0.05208977, -0.05162445, 0.0299241 , 0.04238597, -0.02582334, -0.06225044, -0.0520192 , -0.01448018, 0.04463071, 0.02302869, 0.04366998, -0.01353321, -0.04836373, 0.04309433, -0.05087637, -0.03325786, -0.07186168, -0.07468725, 0.00795705, 0.04991673, 0.01285803, 0.00494267, -0.0638302 , -0.02696754, 0.00817872, 0.06097675, -0.00566234, 0.00751329, -0.04395932, 0.04099512, -0.05165969, -0.03580604, -0.01097626, 0.06958028, -0.07252809, -0.02024011, -0.0524432 , -0.04691889, -0.01265663, 0.00021553, -0.03097472, -0.0273429 , 0.03665897, -0.02845487, -0.03754009, 0.07597657, -0.05234715, 0.01551333, 0.06670195, -0.00533213, 0.03951516, -0.03683328, 0.03466643, 0.05461149, -0.04580925, -0.01130309, -0.05620241, -0.05523331, 0.04891247, 0.06945612, 0.05329306, 0.03032769, -0.02979039, 0.0025692 , 0.02193681, -0.06778944, -0.00227038, -0.05235931, 0.04479021, -0.04634066, 0.03355216, 0.01576688, -0.04026456, -0.03271901, -0.01752607, -0.07638648, 0.04337133, -0.05835104, -0.02573615, 0.04349405, -0.02281209, 0.01962399, 0.05747679, -0.05053102, 0.01988601, -0.05545291, -0.05873819, 0.03166746, 0.04656564, 0.0153602 , 0.01361886, -0.06163072, 0.05065327, 0.07169119, 0.06954852, 0.06025947, 0.05153174, -0.07366416, -0.0365998 , -0.04009761, 0.05625142, 0.06446043, 0.03201801, 0.02971405, 0.0737622 , 0.06616847, 0.00582288, -0.0525829 , -0.04169655, -0.01577994, 0.01071025, -0.05016847, -0.04691961, -0.0317309 , -0.05048965, 0.05053896, 0.07650524, -0.01776609, 0.06789573, -0.00531551, -0.04188078, 0.04051043, 0.0736462 , 0.06111282, -0.01588431, 0.06295784, -0.04965852, -0.06756961, -0.00998105, 0.00980487, 0.00649424, 0.02120406, -0.07384725, 0.02591657, -0.04592149, -0.05354767, 0.06983862, 0.03068741, -0.04121724, -0.06307439, 0.06602945, 0.05159619, 0.00031473, 0.06603036, 0.05710867, -0.07667553, 0.01808936, -0.00217794, 0.03378501, -0.02782313, -0.0585771 , 0.07633144, 0.05268069, 0.0635586 , 0.05886192, 0.02527534, -0.00726158, -0.04635018, -0.02846958, 0.06894683, 0.03781663, 0.04901429, -0.01094481, -0.02215855, -0.02272323, 0.04048209, -0.01064503, 0.05737942, 0.07013195, -0.02945651, 0.04803978, -0.07765157, 0.02616347, 0.04886578, -0.04843407, 0.05140765, 0.06862908, -0.02428162, -0.00615881, -0.01507739, 0.02190077, 0.02378325, -0.07355367, 0.0724471 , -0.01313125, -0.02851717, 0.02268659, -0.03548731, -0.04809987, 0.05358881, 0.02318889, -0.05063765, 0.06249962, 0.01871757, -0.00529746, 0.05262776, -0.067226 , 0.02853401, -0.01622482, -0.07752634, 0.03784851, -0.00392051, -0.01120823, -0.04157882, 0.04765187, -0.02162239, 0.0558276 , -0.03292911, 0.0056406 , 0.0571976 , -0.02646085, 0.00437396, 0.0516505 , -0.04328375, 0.03608196, 0.05058712, -0.01735051, -0.06220594, -0.01035582, 0.02820573, -0.06567286, 0.04494439, -0.04865711, 0.03783672, -0.00416228, -0.05124124, -0.05889187, 0.0672591 , -0.05184856, -0.03336031, -0.00189231, 0.04726206, -0.0611569 , -0.00453743, -0.0029412 , -0.05767642, 0.05269921, 0.02825682, -0.01825115, -0.06266699, 0.06990503, 0.05130588, 0.07483746, 0.03357929, -0.0204674 , 0.05995376, 0.02700124, 0.00525981, 0.04424716, 0.00055878, -0.04075001, -0.01280485, -0.04521654, 0.01661577, 0.02164675, 0.05205575, -0.00765765, -0.01064626, 0.06603251, 0.04269373, -0.00468247, 0.008081 , 0.01047275, 0.0424873 , 0.03780128, -0.01339662, 0.04674398, 0.02245212, -0.01063377, 0.04146469, -0.04531259, -0.03408718, 0.02279444, -0.05073286, 0.04562168, 0.06423643, 0.06099452, -0.02337616, -0.01288495, -0.06582094, -0.01495991, -0.01404245, -0.00909756, 0.05820715, -0.01316472, -0.03580612, -0.06935831, 0.03990944, 0.02947431, -0.03525035, 0.0144262 , -0.03238249, 0.05131285, 0.02010029, -0.04254121, 0.05531391, -0.0467257 , 0.01790263, 0.05152426, 0.05766447, 0.01373304, -0.04750573, -0.05672399, 0.0641381 , -0.04198284, -0.01444486, 0.01031578, -0.05345995, 0.05150107, 0.0200979 , -0.0052268 , -0.03236465, -0.04926898, 0.06556525, 0.04177241, 0.00885976, 0.00389759, 0.06858263, 0.0283224 , -0.05872786, 0.02647729, -0.01962265, -0.02361701, -0.04401072, -0.01043554, 0.07544696, 0.04094814, 0.01860246, 0.05765727, -0.06587842, 0.05198153, -0.07789869, 0.03400592, -0.06168717, 0.02568811, -0.04552713, -0.05306999, -0.02551358] as docembed ) SELECT c.k2 as match_title, SUM(vv1*vv2) / (SQRT(SUM(POW(vv1,2))) * SQRT(SUM(POW(vv2,2)))) AS similarity, ANY_VALUE(c.u2) as match_url FROM ( SELECT a.key k1, a.val v1, b.key k2, b.val v2, a.url u1, b.url u2 FROM ( SELECT '' key, 'query' url, docembed val FROM data limit 1 ) a CROSS JOIN ( SELECT title key, url url, docembed val FROM `gdelt-bq.gdeltv2.gsg_docembed` WHERE DATE(date) = "2021-07-30" ) b ) c , UNNEST(c.v1) vv1 with offset ind1 JOIN UNNEST(c.v2) vv2 with offset ind2 ON (ind1=ind2) GROUP BY c.k1, c.k2 ORDER BY similarity DESC LIMIT 100
This query takes 21 minutes to complete and yields the following results:
Row | match_title | similarity | match_url | |
---|---|---|---|---|
1 |
Risk of blood clots in Pfizer COVID-19 vaccine as likely as AstraZeneca jab: Study
|
0.5096007905750278
|
https://freerepublic.com/focus/f-bloggers/3980674/posts
|
|
2 |
Manitoba sends back over 5,000 AstraZeneca vaccines, slowing supersites – Classic107: Winnipeg's only dedicated classical and jazz radio station.
|
0.4431825337510251
|
https://classic107.com/articles/manitoba-sends-back-over-5000-astrazeneca-vaccines-slowing-supersites
|
|
3 |
Manitoba sends back over 5,000 AstraZeneca vaccines, slowing supersites – CHVNRadio: Southern Manitoba's hub for local and Christian news, and adult contemporary Christian programming.
|
0.4431825337510251
|
https://www.chvnradio.com/articles/manitoba-sends-back-over-5000-astrazeneca-vaccines-slowing-supersites
|
|
4 |
الصحة اليابانية توافق على الاستخدام المحلي "استرازينيكا" البريطاني
|
0.39529954016380403
|
https://www.elbalad.news/4907638
|
|
5 |
Warning issued over vaccine appointment scam
|
0.3915273355127886
|
https://www.rte.ie/news/coronavirus/2021/0730/1238296-vaccine-scam/
|
|
6 |
Torrington Area Health District outreach staff urge parents to update immunizations
|
0.3866629025599832
|
https://www.registercitizen.com/news/article/Torrington-Area-Health-District-outreach-staff-16353387.php
|
|
7 |
HPV Vaccination and Cancer Prevention
|
0.3840681933710871
|
https://www.cancer.org/healthy/hpv-vaccine.html
|
|
8 |
Manitoba sends 5,500 doses of AstraZeneca-Oxford vaccine back to Ottawa
|
0.3788478573936616
|
https://www.cbc.ca/news/canada/manitoba/astra-zeneca-manitoba-returned-covid-19-1.6124203
|
|
9 |
Statystyki szczepień Covid-19 w Polsce 30.07.2021
|
0.37761767354597
|
https://dziennikbaltycki.pl/statystyki-szczepien-covid-19-w-polsce-30072021/ar/c14p1-21791559
|
|
10 |
Szczepienia w Krakowie 30.07.2021. Ile jest zaszczepionych osób przeciwko koronawirusowi?
|
0.3772241320142252
|
https://krakow.naszemiasto.pl/szczepienia-w-krakowie-30072021-ile-jest-zaszczepionych-osob-przeciwko-koronawirusowi/ar/c14p1-21786221
|
|
11 |
Szczepienia przeciwko koronawirusowi w Olsztynie 30.07.2021
|
0.37299377586439314
|
https://olsztyn.naszemiasto.pl/szczepienia-przeciwko-koronawirusowi-w-olsztynie-30072021/ar/c14p1-21786279
|
|
12 |
#EndorseThis: Watch Former Anti-Vaxxers Who Survived COVID Plead For Sanity
|
0.3698139804191956
|
https://www.nationalmemo.com/anti-vaxxer-regret
|
|
13 |
Roscommon Herald — Warning over text scam for Covid vaccine appointments
|
0.36979538041565135
|
https://roscommonherald.ie/2021/07/30/warning-over-text-scam-for-covid-vaccine-appointments/
|
|
14 |
Carlow Nationalist — Warning over text scam for Covid vaccine appointments
|
0.36979538041565135
|
https://carlow-nationalist.ie/2021/07/30/warning-over-text-scam-for-covid-vaccine-appointments/
|
|
15 |
Laois Nationalist — Warning over text scam for Covid vaccine appointments
|
0.36979538041565135
|
https://laois-nationalist.ie/2021/07/30/warning-over-text-scam-for-covid-vaccine-appointments/
|
|
16 |
Waterford News and Star — Warning over text scam for Covid vaccine appointments
|
0.36979538041565135
|
https://waterford-news.ie/2021/07/30/warning-over-text-scam-for-covid-vaccine-appointments/
|
|
17 |
Kildare Nationalist — Warning over text scam for Covid vaccine appointments
|
0.36979538041565135
|
https://kildare-nationalist.ie/2021/07/30/warning-over-text-scam-for-covid-vaccine-appointments/
|
|
18 |
Вакцина "Спутник Лайт" поступила в 109 прививочных пунктов в Петербурге
|
0.36977288199435243
|
https://www.dp.ru/a/2021/07/30/Vakcina_Sputnik_Lajt_po?hash=775837
|
|
19 |
Szczepienia w Warszawie 30.07.2021. Ile jest zaszczepionych osób przeciwko koronawirusowi?
|
0.36926685371966017
|
https://warszawa.naszemiasto.pl/szczepienia-w-warszawie-30072021-ile-jest-zaszczepionych-osob-przeciwko-koronawirusowi/ar/c14p1-21786219
|
|
20 |
Szczepienia we Wrocławiu 30.07.2021. Jak wygląda sytuacja ze szczepieniami przeciwko koronawirusowi w Twoim powiecie?
|
0.3658809090640407
|
https://wroclaw.naszemiasto.pl/szczepienia-we-wroclawiu-30072021-jak-wyglada-sytuacja-ze-szczepieniami-przeciwko-koronawirusowi-w-twoim-powiecie/ar/c14p1-21786227
|
Note that several of the results above are in languages other than English, reflecting the potency of combining machine translation with monolingual document-level embeddings. Note in particular that some of the results only mention blood clots later in the text, rather than in the lead paragraph, reflecting the importance of document-level embeddings over traditional "lead+last" paragraph embeddings.
The low similarity scores reflect the fact that on this particular day (July 30, 2021), there were few articles about blood clots and thus articles about vaccination without mentioning blood clots are returned – these would typically be filtered out by thresholding the similarity scores, but we've left them in for this example.
The GSG Document Embeddings dataset was launched late in the day on July 30th, so there are only 111,619 documents on that particular day. Despite this low number of documents, the query above takes 21 minutes to return. This is because it flattens the document embeddings to be able to process them in native SQL. Could we speed this up by keeping them as native arrays?
BigQuery supports User Defined Functions written in JavaScript, which would allow us to retain our embeddings as arrays, drastically reducing the pressure on the Join stage of the query. The resulting SQL becomes vastly simpler as well:
CREATE TEMPORARY FUNCTION cossim(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>) RETURNS FLOAT64 LANGUAGE js AS ''' var sumt=0, suma=0, sumb=0; for(i=0;i<a.length;i++) { sumt += (a[i]*b[i]); suma += (a[i]*a[i]); sumb += (b[i]*b[i]); } suma = Math.sqrt(suma); sumb = Math.sqrt(sumb); return sumt/(suma*sumb); '''; WITH query AS ( select [0.02818347, 0.02490096, 0.02558763, 0.05044514, -0.03208419, -0.0747378 , 0.05282925, -0.03030587, -0.06323209, 0.03501046, 0.07779447, -0.00064674, -0.01432726, -0.04528232, -0.00013452, 0.01319955, -0.07789714, -0.07262603, -0.04428149, -0.06154352, 0.02736403, 0.05429258, 0.00757298, 0.02407369, 0.06519737, 0.06762105, 0.06284684, -0.04060154, 0.00296181, -0.00768548, -0.06485997, 0.07785969, -0.06720406, 0.04895552, -0.00339502, -0.04219737, -0.06142973, 0.05685892, -0.05649175, 0.00050146, 0.03036945, 0.0344508 , -0.06390501, -0.05346889, 0.02882787, 0.0236582 , -0.05156105, 0.03758312, -0.06801498, 0.04185435, -0.05849047, 0.03997547, 0.00219149, 0.03601582, -0.06715488, -0.01793374, 0.04198214, 0.02073287, 0.02610434, -0.04187581, -0.05638166, -0.03530644, -0.0397791 , 0.03841254, 0.04033826, -0.04213821, -0.04512496, 0.0199297 , -0.02139533, -0.00252358, -0.01885297, -0.01272553, -0.06285991, 0.01819749, 0.04675587, 0.01443767, 0.06732985, -0.00535832, -0.05227207, -0.01604555, 0.04312786, -0.02390673, -0.05696457, 0.05386935, -0.04538085, -0.04626585, 0.03936378, -0.0427114 , -0.05459984, -0.06518955, -0.01438342, -0.00492483, 0.01457008, 0.03740922, -0.0532975 , -0.00163596, 0.02512594, -0.01469385, 0.00970429, 0.07029063, 0.0614634 , -0.01581539, 0.03156977, 0.00409102, 0.00903686, 0.03794029, 0.07164939, -0.07418469, -0.04830661, -0.01725078, 0.01502168, 0.04706262, 0.01221037, -0.05155681, 0.01581833, -0.03036154, -0.07328957, 0.03214511, -0.01459734, -0.01128118, -0.06304299, -0.0584773 , -0.06644399, 0.05438589, -0.03069086, 0.03843722, -0.06924088, 0.05682436, -0.04269011, 0.00858494, 0.02617082, 0.07789769, 0.05098371, -0.02491225, -0.06794736, 0.06156508, 0.00908578, -0.03028868, 0.03840445, 0.02134955, 0.03145362, -0.01892534, -0.02466874, -0.02372607, 0.02861221, 0.03111894, -0.03854304, -0.0773613 , -0.02113795, 0.02595439, 0.04956857, -0.0180923 , 0.02314183, -0.06816887, 0.06035572, -0.05767199, -0.07086307, -0.04779347, -0.00144571, 0.06937918, -0.01462709, -0.02401111, 0.05321732, -0.01356249, 0.00459683, -0.02109838, -0.06687555, -0.0690091 , -0.00606009, 0.06245927, 0.02045952, -0.06919497, -0.00014548, 0.04921691, -0.05208977, -0.05162445, 0.0299241 , 0.04238597, -0.02582334, -0.06225044, -0.0520192 , -0.01448018, 0.04463071, 0.02302869, 0.04366998, -0.01353321, -0.04836373, 0.04309433, -0.05087637, -0.03325786, -0.07186168, -0.07468725, 0.00795705, 0.04991673, 0.01285803, 0.00494267, -0.0638302 , -0.02696754, 0.00817872, 0.06097675, -0.00566234, 0.00751329, -0.04395932, 0.04099512, -0.05165969, -0.03580604, -0.01097626, 0.06958028, -0.07252809, -0.02024011, -0.0524432 , -0.04691889, -0.01265663, 0.00021553, -0.03097472, -0.0273429 , 0.03665897, -0.02845487, -0.03754009, 0.07597657, -0.05234715, 0.01551333, 0.06670195, -0.00533213, 0.03951516, -0.03683328, 0.03466643, 0.05461149, -0.04580925, -0.01130309, -0.05620241, -0.05523331, 0.04891247, 0.06945612, 0.05329306, 0.03032769, -0.02979039, 0.0025692 , 0.02193681, -0.06778944, -0.00227038, -0.05235931, 0.04479021, -0.04634066, 0.03355216, 0.01576688, -0.04026456, -0.03271901, -0.01752607, -0.07638648, 0.04337133, -0.05835104, -0.02573615, 0.04349405, -0.02281209, 0.01962399, 0.05747679, -0.05053102, 0.01988601, -0.05545291, -0.05873819, 0.03166746, 0.04656564, 0.0153602 , 0.01361886, -0.06163072, 0.05065327, 0.07169119, 0.06954852, 0.06025947, 0.05153174, -0.07366416, -0.0365998 , -0.04009761, 0.05625142, 0.06446043, 0.03201801, 0.02971405, 0.0737622 , 0.06616847, 0.00582288, -0.0525829 , -0.04169655, -0.01577994, 0.01071025, -0.05016847, -0.04691961, -0.0317309 , -0.05048965, 0.05053896, 0.07650524, -0.01776609, 0.06789573, -0.00531551, -0.04188078, 0.04051043, 0.0736462 , 0.06111282, -0.01588431, 0.06295784, -0.04965852, -0.06756961, -0.00998105, 0.00980487, 0.00649424, 0.02120406, -0.07384725, 0.02591657, -0.04592149, -0.05354767, 0.06983862, 0.03068741, -0.04121724, -0.06307439, 0.06602945, 0.05159619, 0.00031473, 0.06603036, 0.05710867, -0.07667553, 0.01808936, -0.00217794, 0.03378501, -0.02782313, -0.0585771 , 0.07633144, 0.05268069, 0.0635586 , 0.05886192, 0.02527534, -0.00726158, -0.04635018, -0.02846958, 0.06894683, 0.03781663, 0.04901429, -0.01094481, -0.02215855, -0.02272323, 0.04048209, -0.01064503, 0.05737942, 0.07013195, -0.02945651, 0.04803978, -0.07765157, 0.02616347, 0.04886578, -0.04843407, 0.05140765, 0.06862908, -0.02428162, -0.00615881, -0.01507739, 0.02190077, 0.02378325, -0.07355367, 0.0724471 , -0.01313125, -0.02851717, 0.02268659, -0.03548731, -0.04809987, 0.05358881, 0.02318889, -0.05063765, 0.06249962, 0.01871757, -0.00529746, 0.05262776, -0.067226 , 0.02853401, -0.01622482, -0.07752634, 0.03784851, -0.00392051, -0.01120823, -0.04157882, 0.04765187, -0.02162239, 0.0558276 , -0.03292911, 0.0056406 , 0.0571976 , -0.02646085, 0.00437396, 0.0516505 , -0.04328375, 0.03608196, 0.05058712, -0.01735051, -0.06220594, -0.01035582, 0.02820573, -0.06567286, 0.04494439, -0.04865711, 0.03783672, -0.00416228, -0.05124124, -0.05889187, 0.0672591 , -0.05184856, -0.03336031, -0.00189231, 0.04726206, -0.0611569 , -0.00453743, -0.0029412 , -0.05767642, 0.05269921, 0.02825682, -0.01825115, -0.06266699, 0.06990503, 0.05130588, 0.07483746, 0.03357929, -0.0204674 , 0.05995376, 0.02700124, 0.00525981, 0.04424716, 0.00055878, -0.04075001, -0.01280485, -0.04521654, 0.01661577, 0.02164675, 0.05205575, -0.00765765, -0.01064626, 0.06603251, 0.04269373, -0.00468247, 0.008081 , 0.01047275, 0.0424873 , 0.03780128, -0.01339662, 0.04674398, 0.02245212, -0.01063377, 0.04146469, -0.04531259, -0.03408718, 0.02279444, -0.05073286, 0.04562168, 0.06423643, 0.06099452, -0.02337616, -0.01288495, -0.06582094, -0.01495991, -0.01404245, -0.00909756, 0.05820715, -0.01316472, -0.03580612, -0.06935831, 0.03990944, 0.02947431, -0.03525035, 0.0144262 , -0.03238249, 0.05131285, 0.02010029, -0.04254121, 0.05531391, -0.0467257 , 0.01790263, 0.05152426, 0.05766447, 0.01373304, -0.04750573, -0.05672399, 0.0641381 , -0.04198284, -0.01444486, 0.01031578, -0.05345995, 0.05150107, 0.0200979 , -0.0052268 , -0.03236465, -0.04926898, 0.06556525, 0.04177241, 0.00885976, 0.00389759, 0.06858263, 0.0283224 , -0.05872786, 0.02647729, -0.01962265, -0.02361701, -0.04401072, -0.01043554, 0.07544696, 0.04094814, 0.01860246, 0.05765727, -0.06587842, 0.05198153, -0.07789869, 0.03400592, -0.06168717, 0.02568811, -0.04552713, -0.05306999, -0.02551358] as docembed ) SELECT cossim(doc.docembed, query.docembed) sim, title, url FROM `gdelt-bq.gdeltv2.gsg_docembed` doc, query WHERE DATE(date) = "2021-07-30" order by sim desc limit 100
Here, we simply define our cosine similarity computation as a UDF and then invoke it via a trivial one-line SQL statement at query stage (the bulk of the SQL above is simply the copy-pasted query vector embedding). This query takes just 46 seconds to complete! Comparing the results, they return identical result lists.
Why is the UDF query so much faster? The execution details for the two queries sheds some light. The native SQL query takes 20 minutes 57 seconds to complete, consumes 1 hour 45 minutes of slot time, shuffles 17.38MB and during the Join stage, workers take 689 seconds average, 1,891 seconds max. In contrast, the UDF-based query takes just 45.7 seconds to complete, consumes just 1 min 51 seconds of worker time, shuffles just 39KB and during the Join stage, workers take 29 seconds average, 82 seconds max.
Of course, 46 second latency is still far too slow for production querying, so a real-world application would use locality hashing or a standalone indexing platform like Vertex Matching Engine, but these queries allow you to see how semantic natural language querying works using embeddings!