Earlier today we unveiled the Global Similarity Graph Television News Sentence Embeddings, a massive new dataset of 189 million sentence-level Universal Sentence Encoder embeddings over television news, covering BBC News London (2017-present), CNN, MSNBC, Fox News and the ABC/CBS/NBC evening news broadcasts spanning more than a decade. How might we use this immense dataset to scan television news for known fact check claims?
Today you can use the Television Explorer to keyword search the closed captioning of these stations and selections from more than 150 others for exact keyword matches. For example, to find references to vaccines and microchips together, you can search for "(vaccine OR vaccines) (microchip OR microchips)". However, this will only return captioning clips that contain your exact keywords. A mention of "semiconductor tracking" in vaccines or "chipped vaccines" won't be returned. Moreover, using keyword searches to identify references to fact check claims requires distilling each fact check down to a set of representative keywords that fully encapsulate it, which may be difficult for more complex claims. Put another way, a typical fact check will summarize the claim it is investigating in a sentence or a few sentences of text – to keyword search for this claim requires taking those sentences of text and reducing them to a handful of searchable keywords.
In contrast, our embeddings dataset represents each sentence of closed captioning as an immutable 512-dimension vector that attempts to represent its topical focus. To identify references to a known fact check claim, we can simply convert the entire sentence- or paragraph-long fact check summary verbatim into an embedding and compare its cosine similarity against every one of the sentence embeddings in our dataset to identify potential references to it. A production application would likely use locality hashing or other similar approximate nearest neighbor (ANN) methods to avoid having to perform 189 million brute-force similarity comparisons, but for the purposes of this simple demonstration, we're going to use BigQuery to do a simple brute-force comparison.
Let's start with this Quick Take summary from a FactCheck.org fact check relating to the false claim that Covid-19 vaccines embed microchips in recipients for tracking:
- "A video circulating on social media falsely claims that vaccines for COVID-19 have a microchip that “tracks the location of the patient.” The chip, which is not currently in use, would be attached to the end of a plastic vial and provide information only about the vaccine dose. It cannot track people." (FactCheck.org)
How could we search television news for mentions of this claim?
A traditional approach would be to take these three sentences and attempt to distill them down to a set of searchable keywords using statistical information about word usage like TFIDF distributions to identify "statistically significant phrases." In this case that approach might yield a collection of statistically significant words and phrases like "video, social media, vaccines, COVID-19, microchip, tracks … location, location … patient, chip, plastic vial, vaccine dose, track people." Performing a giant "AND" search for all of these keywords yields no results, so you would have to search for them in combination. The problem is that not all of these keywords/phrases are related to the central claim of tracking. A search for "vaccine + plastic vial" might yield a number of results, but those are not central to the claim of microchip-based tracking. Similarity, a claim of "social media + track people" is not related to the core claim. More advanced NLP techniques could help narrow which phrases are most central to the claim, but even those approaches may not be able to completely distill down the claim into a set of searchable keywords.
Enter the power of embeddings.
First, we take the three-sentence quick take summary above and convert it as-is into an embedding in the same Universal Sentence Encoder vector space. To do so, create a free new Colab notebook and run the following code:
#load libraries... import tensorflow_hub as hub import tensorflow as tf !pip install tensorflow_text import tensorflow_text as text # Needed for loading universal-sentence-encoder-cmlm/multilingual-preprocess import numpy as np #normalize... def normalization(embeds): norms = np.linalg.norm(embeds, 2, axis=1, keepdims=True) return embeds/norms sent = tf.constant(["A video circulating on social media falsely claims that vaccines for COVID-19 have a microchip that “tracks the location of the patient.” The chip, which is not currently in use, would be attached to the end of a plastic vial and provide information only about the vaccine dose. It cannot track people."]) embed_use = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4") sente = embed_use(sent) sente = normalization(sente) print(repr(sente))
It loads the necessary libraries and converts the verbatim three-sentence fact check claim summary as-is into the USE vector representation, printing a 512-dimension array:
<tf.Tensor: shape=(1, 512), dtype=float32, numpy= array([[-6.76356838e-04, 3.35783325e-02, -6.67930841e-02, -7.14375153e-02, -3.08108889e-02, 1.07741086e-02, 2.11005341e-02, 6.49546608e-02, 4.30178866e-02, -6.82501197e-02, 8.06940496e-02, 4.91752811e-02, 6.29389361e-02, 1.66880507e-02, -5.67101641e-03, -3.70305106e-02, -7.97090977e-02, -7.23296637e-03, -7.27019385e-02, 3.61410975e-02, -1.81389842e-02, 3.56576918e-03, -7.38827288e-02, 4.35053706e-02, -3.35104694e-03, 6.37104064e-02, 1.92584172e-02, -5.36565781e-02, 3.81646678e-02, 3.99802737e-02, -5.32769971e-02, 8.15957189e-02, -2.47750692e-02, 4.34365049e-02, 7.18429685e-02, 7.98831061e-02, -8.14278647e-02, 7.30962753e-02, -3.90970185e-02, -2.42321957e-02, 8.85859481e-04, 1.59769729e-02, 1.73619445e-02, -6.09335937e-02, -5.77191003e-02, -5.14351763e-03, 4.95849364e-02, 5.45341708e-02, 3.67251299e-02, 5.23556210e-03, -7.26118013e-02, -1.82263218e-02, 2.83526741e-02, 7.12847263e-02, -7.51046464e-02, -3.59605625e-02, -4.63198870e-02, -5.76271117e-02, 3.94778773e-02, -7.19692186e-02, -1.35769062e-02, 6.82483837e-02, -3.86933982e-02, -4.68094014e-02, 3.57298478e-02, -6.87625632e-02, 3.24299969e-02, 6.28880039e-02, -7.18246251e-02, 3.15887854e-02, -5.05154385e-05, 4.15558480e-02, -5.05858241e-03, -2.23924946e-02, 6.13835901e-02, -3.54572162e-02, 4.43822816e-02, -4.83701378e-02, -5.33702224e-02, 4.93556038e-02, -8.73594370e-04, -7.18877092e-02, -2.08738167e-02, -2.28899792e-02, 1.76748831e-03, -3.10755782e-02, 1.93924904e-02, 5.02826758e-02, 4.10971418e-03, -5.37503473e-02, 5.57333715e-02, 2.84079444e-02, -1.86564196e-02, 2.70265304e-02, -5.42857102e-04, 8.57702456e-03, 6.09128438e-02, -5.45500219e-02, -4.38415771e-03, -4.50687436e-03, -5.93304411e-02, 5.69677241e-02, -5.57250157e-02, -5.81391938e-02, 6.88386261e-02, 7.11042359e-02, 6.56928271e-02, 1.66122485e-02, 6.30950481e-02, -4.72512022e-02, -6.91004544e-02, 1.08533464e-02, 3.22929490e-03, -9.46690096e-04, 6.15085438e-02, -4.10078615e-02, -1.52225317e-02, 7.38443527e-03, -2.84125120e-03, 3.94875668e-02, -5.78187183e-02, -3.46644572e-03, 1.43317459e-02, -5.20273345e-03, -4.29149270e-02, -5.53261004e-02, -1.87431239e-02, -4.35798094e-02, -1.23266215e-02, 6.26963973e-02, 6.76322281e-02, 8.16316083e-02, 2.64641959e-02, -3.53941061e-02, -5.05655743e-02, -3.18637267e-02, -1.14524448e-02, 7.41548762e-02, -6.11535534e-02, -6.55739233e-02, 3.84859666e-02, 6.66911826e-02, -7.20713809e-02, 2.92808632e-03, 5.62359951e-02, 9.51226894e-03, 4.56910357e-02, -7.51297697e-02, 6.90657347e-02, 4.21055108e-02, -2.59263385e-02, 9.81741399e-03, 2.94160913e-03, -5.97751215e-02, 5.50655387e-02, 5.55065367e-03, 4.92441691e-02, 5.39373793e-02, 2.63481354e-03, 5.61112165e-02, -1.34488298e-02, 4.05395217e-02, -5.48825506e-03, -6.58439845e-02, -2.58194543e-02, -1.58555657e-02, 2.34762882e-03, -1.25716645e-02, -4.26669009e-02, 1.60640255e-02, 5.15688658e-02, -7.21045434e-02, -1.73300430e-02, -2.93727703e-02, 1.51059516e-02, 3.61871794e-02, -1.12395249e-02, 2.94390563e-02, 2.10165903e-02, -2.14187857e-02, -6.51167110e-02, -1.82202216e-02, 2.65326332e-02, 6.87541533e-03, 7.72116631e-02, -3.04351989e-02, -8.53389502e-03, 5.89442579e-03, 2.55114846e-02, -3.72634716e-02, 1.35741597e-02, -2.87334360e-02, 2.89946962e-02, -1.76235684e-03, 4.42778356e-02, 4.59584370e-02, -2.07500421e-02, -1.66762974e-02, 3.10814641e-02, -4.84270193e-02, 3.69234122e-02, 1.60416390e-03, 1.90665293e-02, -2.38827430e-02, 2.87359916e-02, -7.04021528e-02, 5.11950590e-02, 2.37841941e-02, 2.78002135e-02, 6.73737004e-02, 1.23101231e-02, -2.61966907e-03, 5.78176118e-02, 3.68967131e-02, -3.38435126e-03, 3.38038430e-02, 7.10640773e-02, -4.38991282e-03, -2.46944465e-03, 6.89607188e-02, -2.10304111e-02, 1.74455028e-02, 4.72253896e-02, 7.55667314e-02, 4.17576358e-02, 5.06599247e-02, -4.47778143e-02, -2.82419380e-02, -5.05971462e-02, -2.29050219e-02, -6.34119362e-02, -2.48881299e-02, -2.07891967e-02, -8.11395496e-02, -7.57279713e-03, 1.01827094e-02, 2.36734990e-02, -8.92031100e-03, -2.52159638e-03, -4.42645922e-02, 2.72163115e-02, -4.16662544e-02, 6.28380328e-02, -6.46699443e-02, 4.98163030e-02, 2.40474264e-03, -4.35062796e-02, 9.96236573e-04, 3.21699865e-03, -7.41114616e-02, 5.10127423e-03, 5.91132231e-03, -2.09343582e-02, 5.36168702e-02, 6.32562339e-02, -1.18424380e-02, -5.33314049e-02, 8.15369189e-02, -3.56774591e-02, -3.64913158e-02, -2.39417814e-02, -1.68477502e-02, 4.09753472e-02, -5.00184186e-02, -3.02095693e-02, -6.65327683e-02, 6.97435886e-02, 6.97659627e-02, 2.44959928e-02, -7.88502675e-03, -1.70990340e-02, -3.60420384e-02, -1.89642422e-02, -7.21183345e-02, -6.83112964e-02, 5.45631945e-02, 5.56440577e-02, -6.96792156e-02, 5.17817736e-02, -5.04019484e-03, -7.98536614e-02, -6.72034398e-02, 3.57697830e-02, 7.33052269e-02, -6.80490360e-02, -6.17038347e-02, -7.66119808e-02, 5.73239997e-02, -1.82283260e-02, -3.99673171e-02, 5.84224798e-02, -7.66021237e-02, 6.21817261e-02, -2.64632311e-02, -2.70551220e-02, -2.09122039e-02, 4.49912027e-02, 5.27962260e-02, -1.61876865e-02, 2.99768839e-02, -3.36280465e-02, -1.51605671e-03, -2.47574896e-02, 5.60405292e-03, 9.68514197e-03, -1.68982614e-02, -7.41703138e-02, 8.06703139e-03, -1.44717349e-02, -1.71089768e-02, 5.54176830e-02, 5.78010641e-02, -1.78876668e-02, -1.29997097e-02, 5.63468151e-02, 7.27057606e-02, -1.10625252e-02, 7.14442134e-03, 2.05701385e-02, 5.05811423e-02, 8.63553584e-03, 5.28230928e-02, 5.43508772e-03, -4.37706057e-03, -1.69210937e-02, 7.57706687e-02, 1.81356780e-02, 4.76261452e-02, 1.06395511e-02, -7.35997260e-02, 6.64972737e-02, 5.22572510e-02, 5.71820438e-02, 2.35127471e-02, 2.56717186e-02, -3.12523060e-02, 3.06110885e-02, -1.37786288e-03, -2.19957810e-02, -3.60827036e-02, 5.08690346e-03, -1.49257006e-02, 7.51743838e-02, 5.96603751e-03, -2.87195779e-02, -6.46486282e-02, -5.29822260e-02, -1.47496245e-03, -7.67807290e-02, 3.60531062e-02, 6.81242943e-02, 2.16352921e-02, -8.55341647e-03, -2.21430194e-02, -8.83253198e-03, 3.59478197e-03, -7.58751631e-02, 5.91248423e-02, 4.42272760e-02, -3.39478180e-02, 4.03287634e-02, -4.57744524e-02, 4.45390902e-02, 6.11837544e-02, -3.16450559e-02, -7.24180341e-02, 2.00887918e-02, -3.19629908e-02, -6.86090300e-03, 3.28799486e-02, -7.06641898e-02, 1.93985011e-02, -3.90757024e-02, -3.66524868e-02, 5.76053411e-02, 6.35175093e-04, 4.37529907e-02, -1.18877320e-02, 6.06463850e-02, -9.88359284e-03, 3.63793671e-02, -7.47962818e-02, -7.39435032e-02, 6.32128567e-02, 6.12870194e-02, -6.58575967e-02, 2.75015971e-03, -4.56172265e-02, 6.30888119e-02, 9.60739609e-03, -5.22800013e-02, -6.43881708e-02, 2.20250804e-02, 1.62106263e-03, -4.56735268e-02, -2.72764824e-02, 1.55690350e-02, -1.04821082e-02, 1.09128319e-02, 4.22615670e-02, 3.59000675e-02, 2.44700797e-02, 1.16510596e-02, -2.50982083e-02, 5.81165403e-02, -2.99764648e-02, 3.09661478e-02, 4.46595661e-02, -5.85869774e-02, -6.54702336e-02, -4.22021002e-02, -1.62350927e-02, -2.27494091e-02, 7.32957125e-02, 7.31576756e-02, -5.63123915e-03, -1.70655418e-02, -1.72184445e-02, 7.96012357e-02, 5.04064001e-02, 1.87336896e-02, 4.66014594e-02, 5.06016947e-02, 2.99742296e-02, 2.36040205e-02, -5.34015521e-02, 1.35052192e-03, 6.80805445e-02, 2.22724825e-02, -3.01939417e-02, -7.31360614e-02, -2.51521859e-02, 4.05842923e-02, 1.60862431e-02, -2.25230474e-02, 2.86010765e-02, 2.29199734e-02, 2.20593587e-02, -5.47554232e-02, 5.78941219e-02, 6.56009391e-02, -7.13857934e-02, -7.48909358e-03, 6.78754002e-02, -2.92496174e-03, -6.95068836e-02, 5.49913049e-02, 3.31136771e-02, -2.13340521e-02, -4.93099019e-02, -1.47924507e-02, -6.91533834e-02, -3.78118120e-02, -6.53230399e-02, -4.87434752e-02, 2.96516586e-02, -7.58804101e-03, -1.24682412e-02, -7.59115368e-02, -3.75635456e-03, 2.93405913e-02, -5.34483194e-02, -1.71132796e-02, 5.20518832e-02, -6.30412847e-02, -5.12841195e-02, -3.42662632e-02, -5.49382344e-02, -6.89640343e-02, 6.04783110e-02, -2.27603670e-02, 6.75819349e-03, 5.91140948e-02, -4.53736931e-02, 3.08122877e-02, -2.23298520e-02, -1.62059460e-02, 4.74171452e-02, -7.03755468e-02, -5.98351210e-02, 4.70205843e-02, -5.80502860e-03, 2.21272446e-02, -7.57156089e-02, 4.97078523e-02, -3.15653495e-02, 4.92160209e-02, -3.86846699e-02, 2.09846185e-03, -5.62710315e-02, -2.08172686e-02, 7.27586523e-02, 3.23538706e-02, -1.12844165e-02, 4.76871207e-02, 7.68466992e-03, -1.23470398e-02, 2.14785617e-02, 2.89252382e-02, 3.02119087e-02, 2.10444834e-02, 2.13446151e-02, -3.27234976e-02, -6.14904426e-02, -6.49609463e-03, -8.24379921e-02, -2.68000960e-02, 7.73761328e-03, -4.13497761e-02, 1.23230414e-02, -6.00103587e-02, 1.27278063e-02]], dtype=float32)>
Now we need to compute the cosine similarity of this vector against all of the sentence-level vectors in the Global Similarity Graph Television News Sentence Embeddings dataset to find potential references to it. Since this particular vaccine falsehood began trending later in 2020, we'll limit ourselves to examining broadcasts that aired from November 1, 2020 to present.
We just copy-paste the vector above into a simple SQL + UDF query in BigQuery which yields:
CREATE TEMPORARY FUNCTION cossim(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>) RETURNS FLOAT64 LANGUAGE js AS ''' var sumt=0, suma=0, sumb=0; for(i=0;i<a.length;i++) { sumt += (a[i]*b[i]); suma += (a[i]*a[i]); sumb += (b[i]*b[i]); } suma = Math.sqrt(suma); sumb = Math.sqrt(sumb); return sumt/(suma*sumb); '''; WITH query AS ( select [-6.76356838e-04, 3.35783325e-02, -6.67930841e-02, -7.14375153e-02, -3.08108889e-02, 1.07741086e-02, 2.11005341e-02, 6.49546608e-02, 4.30178866e-02, -6.82501197e-02, 8.06940496e-02, 4.91752811e-02, 6.29389361e-02, 1.66880507e-02, -5.67101641e-03, -3.70305106e-02, -7.97090977e-02, -7.23296637e-03, -7.27019385e-02, 3.61410975e-02, -1.81389842e-02, 3.56576918e-03, -7.38827288e-02, 4.35053706e-02, -3.35104694e-03, 6.37104064e-02, 1.92584172e-02, -5.36565781e-02, 3.81646678e-02, 3.99802737e-02, -5.32769971e-02, 8.15957189e-02, -2.47750692e-02, 4.34365049e-02, 7.18429685e-02, 7.98831061e-02, -8.14278647e-02, 7.30962753e-02, -3.90970185e-02, -2.42321957e-02, 8.85859481e-04, 1.59769729e-02, 1.73619445e-02, -6.09335937e-02, -5.77191003e-02, -5.14351763e-03, 4.95849364e-02, 5.45341708e-02, 3.67251299e-02, 5.23556210e-03, -7.26118013e-02, -1.82263218e-02, 2.83526741e-02, 7.12847263e-02, -7.51046464e-02, -3.59605625e-02, -4.63198870e-02, -5.76271117e-02, 3.94778773e-02, -7.19692186e-02, -1.35769062e-02, 6.82483837e-02, -3.86933982e-02, -4.68094014e-02, 3.57298478e-02, -6.87625632e-02, 3.24299969e-02, 6.28880039e-02, -7.18246251e-02, 3.15887854e-02, -5.05154385e-05, 4.15558480e-02, -5.05858241e-03, -2.23924946e-02, 6.13835901e-02, -3.54572162e-02, 4.43822816e-02, -4.83701378e-02, -5.33702224e-02, 4.93556038e-02, -8.73594370e-04, -7.18877092e-02, -2.08738167e-02, -2.28899792e-02, 1.76748831e-03, -3.10755782e-02, 1.93924904e-02, 5.02826758e-02, 4.10971418e-03, -5.37503473e-02, 5.57333715e-02, 2.84079444e-02, -1.86564196e-02, 2.70265304e-02, -5.42857102e-04, 8.57702456e-03, 6.09128438e-02, -5.45500219e-02, -4.38415771e-03, -4.50687436e-03, -5.93304411e-02, 5.69677241e-02, -5.57250157e-02, -5.81391938e-02, 6.88386261e-02, 7.11042359e-02, 6.56928271e-02, 1.66122485e-02, 6.30950481e-02, -4.72512022e-02, -6.91004544e-02, 1.08533464e-02, 3.22929490e-03, -9.46690096e-04, 6.15085438e-02, -4.10078615e-02, -1.52225317e-02, 7.38443527e-03, -2.84125120e-03, 3.94875668e-02, -5.78187183e-02, -3.46644572e-03, 1.43317459e-02, -5.20273345e-03, -4.29149270e-02, -5.53261004e-02, -1.87431239e-02, -4.35798094e-02, -1.23266215e-02, 6.26963973e-02, 6.76322281e-02, 8.16316083e-02, 2.64641959e-02, -3.53941061e-02, -5.05655743e-02, -3.18637267e-02, -1.14524448e-02, 7.41548762e-02, -6.11535534e-02, -6.55739233e-02, 3.84859666e-02, 6.66911826e-02, -7.20713809e-02, 2.92808632e-03, 5.62359951e-02, 9.51226894e-03, 4.56910357e-02, -7.51297697e-02, 6.90657347e-02, 4.21055108e-02, -2.59263385e-02, 9.81741399e-03, 2.94160913e-03, -5.97751215e-02, 5.50655387e-02, 5.55065367e-03, 4.92441691e-02, 5.39373793e-02, 2.63481354e-03, 5.61112165e-02, -1.34488298e-02, 4.05395217e-02, -5.48825506e-03, -6.58439845e-02, -2.58194543e-02, -1.58555657e-02, 2.34762882e-03, -1.25716645e-02, -4.26669009e-02, 1.60640255e-02, 5.15688658e-02, -7.21045434e-02, -1.73300430e-02, -2.93727703e-02, 1.51059516e-02, 3.61871794e-02, -1.12395249e-02, 2.94390563e-02, 2.10165903e-02, -2.14187857e-02, -6.51167110e-02, -1.82202216e-02, 2.65326332e-02, 6.87541533e-03, 7.72116631e-02, -3.04351989e-02, -8.53389502e-03, 5.89442579e-03, 2.55114846e-02, -3.72634716e-02, 1.35741597e-02, -2.87334360e-02, 2.89946962e-02, -1.76235684e-03, 4.42778356e-02, 4.59584370e-02, -2.07500421e-02, -1.66762974e-02, 3.10814641e-02, -4.84270193e-02, 3.69234122e-02, 1.60416390e-03, 1.90665293e-02, -2.38827430e-02, 2.87359916e-02, -7.04021528e-02, 5.11950590e-02, 2.37841941e-02, 2.78002135e-02, 6.73737004e-02, 1.23101231e-02, -2.61966907e-03, 5.78176118e-02, 3.68967131e-02, -3.38435126e-03, 3.38038430e-02, 7.10640773e-02, -4.38991282e-03, -2.46944465e-03, 6.89607188e-02, -2.10304111e-02, 1.74455028e-02, 4.72253896e-02, 7.55667314e-02, 4.17576358e-02, 5.06599247e-02, -4.47778143e-02, -2.82419380e-02, -5.05971462e-02, -2.29050219e-02, -6.34119362e-02, -2.48881299e-02, -2.07891967e-02, -8.11395496e-02, -7.57279713e-03, 1.01827094e-02, 2.36734990e-02, -8.92031100e-03, -2.52159638e-03, -4.42645922e-02, 2.72163115e-02, -4.16662544e-02, 6.28380328e-02, -6.46699443e-02, 4.98163030e-02, 2.40474264e-03, -4.35062796e-02, 9.96236573e-04, 3.21699865e-03, -7.41114616e-02, 5.10127423e-03, 5.91132231e-03, -2.09343582e-02, 5.36168702e-02, 6.32562339e-02, -1.18424380e-02, -5.33314049e-02, 8.15369189e-02, -3.56774591e-02, -3.64913158e-02, -2.39417814e-02, -1.68477502e-02, 4.09753472e-02, -5.00184186e-02, -3.02095693e-02, -6.65327683e-02, 6.97435886e-02, 6.97659627e-02, 2.44959928e-02, -7.88502675e-03, -1.70990340e-02, -3.60420384e-02, -1.89642422e-02, -7.21183345e-02, -6.83112964e-02, 5.45631945e-02, 5.56440577e-02, -6.96792156e-02, 5.17817736e-02, -5.04019484e-03, -7.98536614e-02, -6.72034398e-02, 3.57697830e-02, 7.33052269e-02, -6.80490360e-02, -6.17038347e-02, -7.66119808e-02, 5.73239997e-02, -1.82283260e-02, -3.99673171e-02, 5.84224798e-02, -7.66021237e-02, 6.21817261e-02, -2.64632311e-02, -2.70551220e-02, -2.09122039e-02, 4.49912027e-02, 5.27962260e-02, -1.61876865e-02, 2.99768839e-02, -3.36280465e-02, -1.51605671e-03, -2.47574896e-02, 5.60405292e-03, 9.68514197e-03, -1.68982614e-02, -7.41703138e-02, 8.06703139e-03, -1.44717349e-02, -1.71089768e-02, 5.54176830e-02, 5.78010641e-02, -1.78876668e-02, -1.29997097e-02, 5.63468151e-02, 7.27057606e-02, -1.10625252e-02, 7.14442134e-03, 2.05701385e-02, 5.05811423e-02, 8.63553584e-03, 5.28230928e-02, 5.43508772e-03, -4.37706057e-03, -1.69210937e-02, 7.57706687e-02, 1.81356780e-02, 4.76261452e-02, 1.06395511e-02, -7.35997260e-02, 6.64972737e-02, 5.22572510e-02, 5.71820438e-02, 2.35127471e-02, 2.56717186e-02, -3.12523060e-02, 3.06110885e-02, -1.37786288e-03, -2.19957810e-02, -3.60827036e-02, 5.08690346e-03, -1.49257006e-02, 7.51743838e-02, 5.96603751e-03, -2.87195779e-02, -6.46486282e-02, -5.29822260e-02, -1.47496245e-03, -7.67807290e-02, 3.60531062e-02, 6.81242943e-02, 2.16352921e-02, -8.55341647e-03, -2.21430194e-02, -8.83253198e-03, 3.59478197e-03, -7.58751631e-02, 5.91248423e-02, 4.42272760e-02, -3.39478180e-02, 4.03287634e-02, -4.57744524e-02, 4.45390902e-02, 6.11837544e-02, -3.16450559e-02, -7.24180341e-02, 2.00887918e-02, -3.19629908e-02, -6.86090300e-03, 3.28799486e-02, -7.06641898e-02, 1.93985011e-02, -3.90757024e-02, -3.66524868e-02, 5.76053411e-02, 6.35175093e-04, 4.37529907e-02, -1.18877320e-02, 6.06463850e-02, -9.88359284e-03, 3.63793671e-02, -7.47962818e-02, -7.39435032e-02, 6.32128567e-02, 6.12870194e-02, -6.58575967e-02, 2.75015971e-03, -4.56172265e-02, 6.30888119e-02, 9.60739609e-03, -5.22800013e-02, -6.43881708e-02, 2.20250804e-02, 1.62106263e-03, -4.56735268e-02, -2.72764824e-02, 1.55690350e-02, -1.04821082e-02, 1.09128319e-02, 4.22615670e-02, 3.59000675e-02, 2.44700797e-02, 1.16510596e-02, -2.50982083e-02, 5.81165403e-02, -2.99764648e-02, 3.09661478e-02, 4.46595661e-02, -5.85869774e-02, -6.54702336e-02, -4.22021002e-02, -1.62350927e-02, -2.27494091e-02, 7.32957125e-02, 7.31576756e-02, -5.63123915e-03, -1.70655418e-02, -1.72184445e-02, 7.96012357e-02, 5.04064001e-02, 1.87336896e-02, 4.66014594e-02, 5.06016947e-02, 2.99742296e-02, 2.36040205e-02, -5.34015521e-02, 1.35052192e-03, 6.80805445e-02, 2.22724825e-02, -3.01939417e-02, -7.31360614e-02, -2.51521859e-02, 4.05842923e-02, 1.60862431e-02, -2.25230474e-02, 2.86010765e-02, 2.29199734e-02, 2.20593587e-02, -5.47554232e-02, 5.78941219e-02, 6.56009391e-02, -7.13857934e-02, -7.48909358e-03, 6.78754002e-02, -2.92496174e-03, -6.95068836e-02, 5.49913049e-02, 3.31136771e-02, -2.13340521e-02, -4.93099019e-02, -1.47924507e-02, -6.91533834e-02, -3.78118120e-02, -6.53230399e-02, -4.87434752e-02, 2.96516586e-02, -7.58804101e-03, -1.24682412e-02, -7.59115368e-02, -3.75635456e-03, 2.93405913e-02, -5.34483194e-02, -1.71132796e-02, 5.20518832e-02, -6.30412847e-02, -5.12841195e-02, -3.42662632e-02, -5.49382344e-02, -6.89640343e-02, 6.04783110e-02, -2.27603670e-02, 6.75819349e-03, 5.91140948e-02, -4.53736931e-02, 3.08122877e-02, -2.23298520e-02, -1.62059460e-02, 4.74171452e-02, -7.03755468e-02, -5.98351210e-02, 4.70205843e-02, -5.80502860e-03, 2.21272446e-02, -7.57156089e-02, 4.97078523e-02, -3.15653495e-02, 4.92160209e-02, -3.86846699e-02, 2.09846185e-03, -5.62710315e-02, -2.08172686e-02, 7.27586523e-02, 3.23538706e-02, -1.12844165e-02, 4.76871207e-02, 7.68466992e-03, -1.23470398e-02, 2.14785617e-02, 2.89252382e-02, 3.02119087e-02, 2.10444834e-02, 2.13446151e-02, -3.27234976e-02, -6.14904426e-02, -6.49609463e-03, -8.24379921e-02, -2.68000960e-02, 7.73761328e-03, -4.13497761e-02, 1.23230414e-02, -6.00103587e-02, 1.27278063e-02] as sentEmbed ) SELECT cossim(doc.sentEmbed, query.sentEmbed) sim, date, lead, previewUrl FROM `gdelt-bq.gdeltv2.gsg_iatvsentembed` doc, query WHERE DATE(date) >= "2020-11-01" order by sim desc limit 10000
This query compares the fact check claim vector against all of the captioning sentences in our dataset since November 1, 2020, totaling more than 13.1 million sentences. BigQuery's massive scale really shines here as it completes all 13.1 million cosine similarity comparisons, sorts the results and outputs the top 10,000 most similar results in just 28 seconds from start to finish.
You can see the top 20 most similar sentences in the table below. These include examples like the following:
- "many people have a fear that the vaccine will cause a lot of harm or maybe the goal of the vaccine is somehow tracking people with a microchip or some connection to 5g." (Fox News at Night With Shannon Bream December 7, 2020 8:00pm PST)
- "false claims that vaccines will be used to inject microchips, to cause deliberate harm or to alter your dna are resurfacing on social media." (BBC News December 9, 2020 5:00pm GMT)
- "among the misleading notions is the idea that the vaccines are delivered with a microchip or bar code to keep track of people as well as a lie that the vaccines will hurt everyone's health." (Deadline White House MSNBC December 17, 2020 1:00pm PST)
- "anything from the nano chip is in the vaccine and people will be tracked, to where the whole covid-19 pandemic is a hoax and untrue." (Early Start With Christine Romans and Laura Jarrett CNN March 25, 2021 2:00am PDT)
The last sentence is particularly interesting in that it would have been missed by a traditional keyword search since it references a "nano chip" rather than a "microchip." A keyword search would not be able to recognize that a "nano chip" is likely the same as a "microchip" in this context, but the embedding model is able to see that the two are likely one in the same in this particular sentence and thus yields an embedding that is highly similar to our fact check claim.
Row | sim | date | lead | previewUrl | |
---|---|---|---|---|---|
1 |
0.6979889356384344
|
2020-12-08 04:55:06 UTC
|
MANY PEOPLE
|
https://archive.org/details/FOXNEWSW_20201208_040000_Fox_News_at_Night_With_Shannon_Bream/start/3306
|
|
2 |
0.6853320074239302
|
2020-12-08 08:55:22 UTC
|
MANY PEOPLE
|
https://archive.org/details/FOXNEWSW_20201208_080000_Fox_News_at_Night_With_Shannon_Bream/start/3322
|
|
3 |
0.6629430437153419
|
2020-12-17 21:30:17 UTC
|
AMONG THE
|
https://archive.org/details/MSNBCW_20201217_210000_Deadline_White_House/start/1817
|
|
4 |
0.6557845795818997
|
2021-03-25 09:48:53 UTC
|
ANYTHING FROM
|
https://archive.org/details/CNNW_20210325_090000_Early_Start_With_Christine_Romans_and_Laura_Jarrett/start/2933
|
|
5 |
0.6542869766187697
|
2020-12-13 17:11:01 UTC
|
WE DON'T
|
https://archive.org/details/FOXNEWSW_20201213_170000_Americas_News_Headquarters/start/661
|
|
6 |
0.651444391945469
|
2021-07-13 23:52:32 UTC
|
WHO STILL
|
https://archive.org/details/CNNW_20210713_230000_Erin_Burnett_OutFront/start/3152
|
|
7 |
0.6475100945917306
|
2020-12-17 21:33:55 UTC
|
MORE REGULATED
|
https://archive.org/details/MSNBCW_20201217_210000_Deadline_White_House/start/2035
|
|
8 |
0.6444304289007284
|
2020-12-09 20:13:17 UTC
|
Along with
|
https://archive.org/details/BBCNEWS_20201209_200000_BBC_News/start/797
|
|
9 |
0.6444304289007284
|
2020-12-09 16:52:54 UTC
|
Along with
|
https://archive.org/details/BBCNEWS_20201209_140000_BBC_News/start/10374
|
|
10 |
0.6418026033632469
|
2021-06-09 20:49:30 UTC
|
THERE IS
|
https://archive.org/details/CNNW_20210609_200000_The_Lead_With_Jake_Tapper/start/2970
|
|
11 |
0.6393235478780447
|
2021-05-04 19:25:44 UTC
|
THEY'RE GOING
|
https://archive.org/details/CNNW_20210504_190000_CNN_Newsroom_With_Alisyn_Camerota_and_Victor_Blackwell/start/1544
|
|
12 |
0.6384793753449715
|
2021-07-19 22:02:17 UTC
|
OVER THE
|
https://archive.org/details/FOXNEWSW_20210719_220000_Special_Report_With_Bret_Baier/start/137
|
|
13 |
0.6333324162072822
|
2020-12-09 17:48:46 UTC
|
False claims
|
https://archive.org/details/BBCNEWS_20201209_170000_BBC_News/start/2926
|
|
14 |
0.6312473942503065
|
2020-12-17 15:31:31 UTC
|
IT'S ACTUALLY
|
https://archive.org/details/CNNW_20201217_150000_CNN_Newsroom_With_Poppy_Harlow_and_Jim_Sciutto/start/1891
|
|
15 |
0.6285368401936927
|
2020-12-23 22:48:40 UTC
|
WE'RE DOING
|
https://archive.org/details/MSNBCW_20201223_210000_Deadline_White_House/start/6520
|
|
16 |
0.6282257384988286
|
2020-12-03 15:14:25 UTC
|
From March
|
https://archive.org/details/BBCNEWS_20201203_140000_BBC_News/start/4465
|
|
17 |
0.6282257384988286
|
2020-12-03 14:51:40 UTC
|
From March
|
https://archive.org/details/BBCNEWS_20201203_140000_BBC_News/start/3100
|
|
18 |
0.6224919468230589
|
2021-02-03 13:02:52 UTC
|
unknown –
|
https://archive.org/details/BBCNEWS_20210203_130000_BBC_News_at_One/start/172
|
|
19 |
0.6220374958075189
|
2020-11-17 16:46:11 UTC
|
will it
|
https://archive.org/details/BBCNEWS_20201117_140000_BBC_News/start/9971
|
|
20 |
0.6219834389078501
|
2020-12-05 04:02:26 UTC
|
THEY'VE CLUED
|
https://archive.org/details/MSNBCW_20201205_040000_The_11th_Hour_With_Brian_Williams/start/146
|
We're tremendously excited to see what kinds of new applications this enables!