Tips For Building A Query Service Using The New Global Similarity Graph Document Embeddings

Given all of the interest and questions we've heard about how to build production-scale query services using the new Global Similarity Graph Document Embeddings dataset, we are releasing this list of tips to help get you started!

Creating Query Embeddings

The first step in working with the embeddings is to efficiently convert your input queries into the same Universal Sentence Encoder embedding used in the dataset. We are using the Universal Sentence Encoder v4 model to create our embeddings. To query the GSG dataset you must use exactly the same model the dataset uses, since every embedding model produces different embeddings.

Limitations Of Embeddings

It is important to recognize the strengths and limitations of embeddings:

Similarity rather than exact word search. Embeddings are a powerful way of looking past individual word choices towards the general topics they describe. If you want a search for "semiconductors" to return "microchips" and "nano chips" or connect a "stabbing" and a "knife attack" or "White House" and "US President" they are an ideal choice. On the other hand, if the actual word itself is more important than its meaning (you specifically want coverage mentioning "Joe Biden" or "Donald Trump"), embeddings are not the best option. Similarly, if you need to distinguish between a "microchip" and a "semiconductor," embeddings might not provide sufficient distinguishing power, since they are designed to group related terms together, not differentiate between them.
No phrase or proximity matching. The GSG computes document-level embeddings, which capture the overall topical distribution of the entire article. While phrases will be understood to the degree they convey a concept distinct from the unrelated appearances of those words ("White House" vs "house" and "white" appearing separately), the keyword search concept of a "phrase search" or proximity constraints requiring a given set of keywords to appear near one another are not supported by embeddings.
What does it mean for words to be "similar"? Embeddings work by essentially grouping all of the words of a language into 512 dimensions based on their statistical cooccurrence in large volumes of training material. Unlike human-created taxonomies, these 512 dimensions are statistically generated entirely by machine, meaning that the specific groupings that result may or may not match our own understanding of those terms. This means some searches will return exactly what you expect, while others may yield more peripheral content.
Finding isolated references. When you search using the embedding for "vaccine microchip," will the search return articles that only casually reference your query? Similar to keyword searching, the more times the query terms appear in an article, the more similar its cosine similarity will be. The embedding for a 10,000-word article that mentions the words "vaccine" and "microchip" each once will have a far lower cosine similarity to "vaccine microchip" than will a 500-word article that mentions both words repeatedly. At the same time, keyword searches will still return an isolated reference to "vaccine" and "microchip" in that 10,000-word article, whereas embeddings will yield such a low similarity that the article will be indistinguishable from other articles that don't mention the terms at all. This is the point of embedding-based search: it ranks topically similar articles more similarly, even if they use entirely different language, but at the same time, it is less useful for identifying isolated references.
How partial similarity works. When you search using the embedding for "vaccine microchip," how are resulting articles ranked when using cosine similarity? An article that mentions both "vaccine" and "microchip" repeatedly will be ranked the highest, but it is important to note that an article that mentions neither term, but perhaps contains other peripheral words may still have a reasonably high similarity score. You can see how these scores typically range in our experiments using the USE family of models. The problem is that you can't weight one term over another in the case of partial matches the way you can with keyword searches. An article that mentions automotive microchips heavily could have an identical similarity score to one discussing vaccines heavily – there is no way to bias the query more towards vaccines than microchips. (You could obviously add additional vaccine-related terms of the query, but that would then penalize articles that mention vaccines and microchips equally). In contrast, with keyword searching, most indexing platforms allow term weighting.

In practice, the limitations above mean for many use cases you may want to combine the GSG with the GKG, GEG or other GDELT datasets (using the article URL as the unique join key) to perform additional filtering. For example, you could search the GKG for vaccine-related coverage and then use the GSG to cluster that coverage.

Hardware Requirements

We are using the DAN-based USE model, which is extremely efficient and does not require hardware acceleration in the form of a GPU or TPU, only a CPU. It benefits strongly from faster processors (in GCP the C2 family provides a 2x speedup over E2 VMs), but requires only CPU resources to run, offering maximal flexibility in running the model anywhere. The particular workflow we use is optimized for CPUs that support AVX512 instruction sets ("optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA"), so that may also factor into your processor selection.

Most importantly, this particular DAN model has effectively linear inference time with input length, meaning you can convert an entire document into an embedding at almost the same speed as a short search engine query. This makes it easy to use more complex inputs for your queries and means that even if computing embeddings for large inputs you do not need accelerators.

Creating One-Off Embeddings

When first experimenting with the dataset, you can use the free Colab service to manually convert a textual passage into a USEv4 embedding to prototype different ideas. Just create a free new Colab notebook and run the following code (replace the sample sentence with any string:

#load libraries...
import tensorflow_hub as hub
import tensorflow as tf
!pip install tensorflow_text
import tensorflow_text as text # Needed for loading universal-sentence-encoder-cmlm/multilingual-preprocess
import numpy as np

#normalize...
def normalization(embeds):
norms = np.linalg.norm(embeds, 2, axis=1, keepdims=True)
return embeds/norms

sent = tf.constant(["A video circulating on social media falsely claims that vaccines for COVID-19 have a microchip that “tracks the location of the patient.” The chip, which is not currently in use, would be attached to the end of a plastic vial and provide information only about the vaccine dose. It cannot track people."])
embed_use = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
sente = embed_use(sent)
sente = normalization(sente)
print(repr(sente))

The output will be a 512-dimension array:

<tf.Tensor: shape=(1, 512), dtype=float32, numpy= array([[-6.76356838e-04, 3.35783325e-02, -6.67930841e-02, -7.14375153e-02, -3.08108889e-02, 1.07741086e-02, 2.11005341e-02, 6.49546608e-02, 4.30178866e-02, -6.82501197e-02, 8.06940496e-02, 4.91752811e-02, 6.29389361e-02, 1.66880507e-02, -5.67101641e-03, -3.70305106e-02, -7.97090977e-02, -7.23296637e-03, -7.27019385e-02, 3.61410975e-02, -1.81389842e-02, 3.56576918e-03, -7.38827288e-02, 4.35053706e-02, -3.35104694e-03, 6.37104064e-02, 1.92584172e-02, -5.36565781e-02, 3.81646678e-02, 3.99802737e-02, -5.32769971e-02, 8.15957189e-02, -2.47750692e-02, 4.34365049e-02, 7.18429685e-02, 7.98831061e-02, -8.14278647e-02, 7.30962753e-02, -3.90970185e-02, -2.42321957e-02, 8.85859481e-04, 1.59769729e-02, 1.73619445e-02, -6.09335937e-02, -5.77191003e-02, -5.14351763e-03, 4.95849364e-02, 5.45341708e-02, 3.67251299e-02, 5.23556210e-03, -7.26118013e-02, -1.82263218e-02, 2.83526741e-02, 7.12847263e-02, -7.51046464e-02, -3.59605625e-02, -4.63198870e-02, -5.76271117e-02, 3.94778773e-02, -7.19692186e-02, -1.35769062e-02, 6.82483837e-02, -3.86933982e-02, -4.68094014e-02, 3.57298478e-02, -6.87625632e-02, 3.24299969e-02, 6.28880039e-02, -7.18246251e-02, 3.15887854e-02, -5.05154385e-05, 4.15558480e-02, -5.05858241e-03, -2.23924946e-02, 6.13835901e-02, -3.54572162e-02, 4.43822816e-02, -4.83701378e-02, -5.33702224e-02, 4.93556038e-02, -8.73594370e-04, -7.18877092e-02, -2.08738167e-02, -2.28899792e-02, 1.76748831e-03, -3.10755782e-02, 1.93924904e-02, 5.02826758e-02, 4.10971418e-03, -5.37503473e-02, 5.57333715e-02, 2.84079444e-02, -1.86564196e-02, 2.70265304e-02, -5.42857102e-04, 8.57702456e-03, 6.09128438e-02, -5.45500219e-02, -4.38415771e-03, -4.50687436e-03, -5.93304411e-02, 5.69677241e-02, -5.57250157e-02, -5.81391938e-02, 6.88386261e-02, 7.11042359e-02, 6.56928271e-02, 1.66122485e-02, 6.30950481e-02, -4.72512022e-02, -6.91004544e-02, 1.08533464e-02, 3.22929490e-03, -9.46690096e-04, 6.15085438e-02, -4.10078615e-02, -1.52225317e-02, 7.38443527e-03, -2.84125120e-03, 3.94875668e-02, -5.78187183e-02, -3.46644572e-03, 1.43317459e-02, -5.20273345e-03, -4.29149270e-02, -5.53261004e-02, -1.87431239e-02, -4.35798094e-02, -1.23266215e-02, 6.26963973e-02, 6.76322281e-02, 8.16316083e-02, 2.64641959e-02, -3.53941061e-02, -5.05655743e-02, -3.18637267e-02, -1.14524448e-02, 7.41548762e-02, -6.11535534e-02, -6.55739233e-02, 3.84859666e-02, 6.66911826e-02, -7.20713809e-02, 2.92808632e-03, 5.62359951e-02, 9.51226894e-03, 4.56910357e-02, -7.51297697e-02, 6.90657347e-02, 4.21055108e-02, -2.59263385e-02, 9.81741399e-03, 2.94160913e-03, -5.97751215e-02, 5.50655387e-02, 5.55065367e-03, 4.92441691e-02, 5.39373793e-02, 2.63481354e-03, 5.61112165e-02, -1.34488298e-02, 4.05395217e-02, -5.48825506e-03, -6.58439845e-02, -2.58194543e-02, -1.58555657e-02, 2.34762882e-03, -1.25716645e-02, -4.26669009e-02, 1.60640255e-02, 5.15688658e-02, -7.21045434e-02, -1.73300430e-02, -2.93727703e-02, 1.51059516e-02, 3.61871794e-02, -1.12395249e-02, 2.94390563e-02, 2.10165903e-02, -2.14187857e-02, -6.51167110e-02, -1.82202216e-02, 2.65326332e-02, 6.87541533e-03, 7.72116631e-02, -3.04351989e-02, -8.53389502e-03, 5.89442579e-03, 2.55114846e-02, -3.72634716e-02, 1.35741597e-02, -2.87334360e-02, 2.89946962e-02, -1.76235684e-03, 4.42778356e-02, 4.59584370e-02, -2.07500421e-02, -1.66762974e-02, 3.10814641e-02, -4.84270193e-02, 3.69234122e-02, 1.60416390e-03, 1.90665293e-02, -2.38827430e-02, 2.87359916e-02, -7.04021528e-02, 5.11950590e-02, 2.37841941e-02, 2.78002135e-02, 6.73737004e-02, 1.23101231e-02, -2.61966907e-03, 5.78176118e-02, 3.68967131e-02, -3.38435126e-03, 3.38038430e-02, 7.10640773e-02, -4.38991282e-03, -2.46944465e-03, 6.89607188e-02, -2.10304111e-02, 1.74455028e-02, 4.72253896e-02, 7.55667314e-02, 4.17576358e-02, 5.06599247e-02, -4.47778143e-02, -2.82419380e-02, -5.05971462e-02, -2.29050219e-02, -6.34119362e-02, -2.48881299e-02, -2.07891967e-02, -8.11395496e-02, -7.57279713e-03, 1.01827094e-02, 2.36734990e-02, -8.92031100e-03, -2.52159638e-03, -4.42645922e-02, 2.72163115e-02, -4.16662544e-02, 6.28380328e-02, -6.46699443e-02, 4.98163030e-02, 2.40474264e-03, -4.35062796e-02, 9.96236573e-04, 3.21699865e-03, -7.41114616e-02, 5.10127423e-03, 5.91132231e-03, -2.09343582e-02, 5.36168702e-02, 6.32562339e-02, -1.18424380e-02, -5.33314049e-02, 8.15369189e-02, -3.56774591e-02, -3.64913158e-02, -2.39417814e-02, -1.68477502e-02, 4.09753472e-02, -5.00184186e-02, -3.02095693e-02, -6.65327683e-02, 6.97435886e-02, 6.97659627e-02, 2.44959928e-02, -7.88502675e-03, -1.70990340e-02, -3.60420384e-02, -1.89642422e-02, -7.21183345e-02, -6.83112964e-02, 5.45631945e-02, 5.56440577e-02, -6.96792156e-02, 5.17817736e-02, -5.04019484e-03, -7.98536614e-02, -6.72034398e-02, 3.57697830e-02, 7.33052269e-02, -6.80490360e-02, -6.17038347e-02, -7.66119808e-02, 5.73239997e-02, -1.82283260e-02, -3.99673171e-02, 5.84224798e-02, -7.66021237e-02, 6.21817261e-02, -2.64632311e-02, -2.70551220e-02, -2.09122039e-02, 4.49912027e-02, 5.27962260e-02, -1.61876865e-02, 2.99768839e-02, -3.36280465e-02, -1.51605671e-03, -2.47574896e-02, 5.60405292e-03, 9.68514197e-03, -1.68982614e-02, -7.41703138e-02, 8.06703139e-03, -1.44717349e-02, -1.71089768e-02, 5.54176830e-02, 5.78010641e-02, -1.78876668e-02, -1.29997097e-02, 5.63468151e-02, 7.27057606e-02, -1.10625252e-02, 7.14442134e-03, 2.05701385e-02, 5.05811423e-02, 8.63553584e-03, 5.28230928e-02, 5.43508772e-03, -4.37706057e-03, -1.69210937e-02, 7.57706687e-02, 1.81356780e-02, 4.76261452e-02, 1.06395511e-02, -7.35997260e-02, 6.64972737e-02, 5.22572510e-02, 5.71820438e-02, 2.35127471e-02, 2.56717186e-02, -3.12523060e-02, 3.06110885e-02, -1.37786288e-03, -2.19957810e-02, -3.60827036e-02, 5.08690346e-03, -1.49257006e-02, 7.51743838e-02, 5.96603751e-03, -2.87195779e-02, -6.46486282e-02, -5.29822260e-02, -1.47496245e-03, -7.67807290e-02, 3.60531062e-02, 6.81242943e-02, 2.16352921e-02, -8.55341647e-03, -2.21430194e-02, -8.83253198e-03, 3.59478197e-03, -7.58751631e-02, 5.91248423e-02, 4.42272760e-02, -3.39478180e-02, 4.03287634e-02, -4.57744524e-02, 4.45390902e-02, 6.11837544e-02, -3.16450559e-02, -7.24180341e-02, 2.00887918e-02, -3.19629908e-02, -6.86090300e-03, 3.28799486e-02, -7.06641898e-02, 1.93985011e-02, -3.90757024e-02, -3.66524868e-02, 5.76053411e-02, 6.35175093e-04, 4.37529907e-02, -1.18877320e-02, 6.06463850e-02, -9.88359284e-03, 3.63793671e-02, -7.47962818e-02, -7.39435032e-02, 6.32128567e-02, 6.12870194e-02, -6.58575967e-02, 2.75015971e-03, -4.56172265e-02, 6.30888119e-02, 9.60739609e-03, -5.22800013e-02, -6.43881708e-02, 2.20250804e-02, 1.62106263e-03, -4.56735268e-02, -2.72764824e-02, 1.55690350e-02, -1.04821082e-02, 1.09128319e-02, 4.22615670e-02, 3.59000675e-02, 2.44700797e-02, 1.16510596e-02, -2.50982083e-02, 5.81165403e-02, -2.99764648e-02, 3.09661478e-02, 4.46595661e-02, -5.85869774e-02, -6.54702336e-02, -4.22021002e-02, -1.62350927e-02, -2.27494091e-02, 7.32957125e-02, 7.31576756e-02, -5.63123915e-03, -1.70655418e-02, -1.72184445e-02, 7.96012357e-02, 5.04064001e-02, 1.87336896e-02, 4.66014594e-02, 5.06016947e-02, 2.99742296e-02, 2.36040205e-02, -5.34015521e-02, 1.35052192e-03, 6.80805445e-02, 2.22724825e-02, -3.01939417e-02, -7.31360614e-02, -2.51521859e-02, 4.05842923e-02, 1.60862431e-02, -2.25230474e-02, 2.86010765e-02, 2.29199734e-02, 2.20593587e-02, -5.47554232e-02, 5.78941219e-02, 6.56009391e-02, -7.13857934e-02, -7.48909358e-03, 6.78754002e-02, -2.92496174e-03, -6.95068836e-02, 5.49913049e-02, 3.31136771e-02, -2.13340521e-02, -4.93099019e-02, -1.47924507e-02, -6.91533834e-02, -3.78118120e-02, -6.53230399e-02, -4.87434752e-02, 2.96516586e-02, -7.58804101e-03, -1.24682412e-02, -7.59115368e-02, -3.75635456e-03, 2.93405913e-02, -5.34483194e-02, -1.71132796e-02, 5.20518832e-02, -6.30412847e-02, -5.12841195e-02, -3.42662632e-02, -5.49382344e-02, -6.89640343e-02, 6.04783110e-02, -2.27603670e-02, 6.75819349e-03, 5.91140948e-02, -4.53736931e-02, 3.08122877e-02, -2.23298520e-02, -1.62059460e-02, 4.74171452e-02, -7.03755468e-02, -5.98351210e-02, 4.70205843e-02, -5.80502860e-03, 2.21272446e-02, -7.57156089e-02, 4.97078523e-02, -3.15653495e-02, 4.92160209e-02, -3.86846699e-02, 2.09846185e-03, -5.62710315e-02, -2.08172686e-02, 7.27586523e-02, 3.23538706e-02, -1.12844165e-02, 4.76871207e-02, 7.68466992e-03, -1.23470398e-02, 2.14785617e-02, 2.89252382e-02, 3.02119087e-02, 2.10444834e-02, 2.13446151e-02, -3.27234976e-02, -6.14904426e-02, -6.49609463e-03, -8.24379921e-02, -2.68000960e-02, 7.73761328e-03, -4.13497761e-02, 1.23230414e-02, -6.00103587e-02, 1.27278063e-02]], dtype=float32)>

The inner array of float32's is the embedding of that particular string under the USEv4 model.

Production Embedding Generation

In a production application you obviously need a more automated workflow that supports high-volume embedding computation. Thankfully, it is trivial to deploy USEv4 inside of Tensorflow Server in docker container to deploy a RESTful server that supports high-volume embedding computation. In fact, this workflow is so efficient that the limiting bottleneck rapidly becomes the Docker networking layer, so it is important to use the "–net=host" parameter to achieve a 20% speedup for high-volume use cases.

To start from scratch on GCP, spin up a new C2 VM (this will provide a 2x speedup over E2 VMs but you can also use E2 VM's as needed). The USEv4 model is so efficient that you won't need many cores. Internally we use a 4-core C2 VM to generate all of our embeddings with spare CPU capacity to run other models on the same machine.

Create a new directory on the VM:

/TENSORFLOW/models/universal-sentence-encoder/4/

And click the orange "Download" button on the Universal Sentence Encoder v4 model TFHub page to download the 1GB compressed model to that directory. Then unpack it to that directory. When you are done it should look like:

$find /TENSORFLOW/models/universal-sentence-encoder/4/
/TENSORFLOW/models/universal-sentence-encoder/4/
/TENSORFLOW/models/universal-sentence-encoder/4/assets
/TENSORFLOW/models/universal-sentence-encoder/4/variables
/TENSORFLOW/models/universal-sentence-encoder/4/variables/variables.data-00000-of-00001
/TENSORFLOW/models/universal-sentence-encoder/4/variables/variables.index
/TENSORFLOW/models/universal-sentence-encoder/4/saved_model.pb

Install Docker on your VM and then run the following command to install and run TensorFlow Server running the USEv4 model:

docker run -t --net=host --restart always --name tf-serve-universal-sentence-encoder -v "/TENSORFLOW/models:/models" -e MODEL_NAME="universal-sentence-encoder" -t tensorflow/serving --rest_api_port=8501 --enable_model_warmup=true&

This single command will download the TF Server to your VM and run it and in a matter of seconds you will have a production-capable embedding server running!

Here's an explanation of the parameters we use:

-t: Allocates a pseudo-TTY
–net=host: By default Docker will manage networking between the container and host VM to provide isolation. The USEv4 model is so efficient that this quickly becomes a limiting factor so this causes the container to share the host's networking, providing a 20% speedup. Note that this also makes the server available to the world on the designated port, so you will need to secure the VM to ensure that only trusted local traffic can reach the VM. This introduces additional security concerns and is not suitable for shared tenancy environments where the container is running on a VM or network shared with other users. You should carefully consider the security and other ramifications of this option and may need to take additional mitigation steps, such as running it on a VPC-isolated dedicated VM with careful ingress controls. On a closed network, this option has the added benefit that any process running on any other VM in the local area network can generate embeddings simply by sending RESTful traffic to the VM running TensorFlow Server, allowing it to act as a wide area service and focus only on generating embeddings.
–restart always: Restarts the VM when it crashes or the system reboots.
–name tf-serve-universal-sentence-encoder: This names the container to make it easy to manage using Docker commands. You can select any name you wish.
-v "/TENSORFLOW/models:/models": This binds the VM host directory "/TENSORFLOW/models" that we downloaded the model into onto the container's filesystem at mount point "/models/" where the server will look for the model. You can think of this like mounting a USB drive onto your computer – it simply grafts it onto the container's filesystem.
-e MODEL_NAME="universal-sentence-encoder": Tells TensorFlow Server what model to look for and must match the subdirectory you downloaded the model into earlier. In other words, using "-v "/TENSORFLOW/models:/models"" we've told TensorFlow Server that all models will be stored at "/TENSORFLOW/models" on our VM and then "-e MODEL_NAME="universal-sentence-encoder"" tells it what subdirectory under that directory to look for the model we want – in this case it looks under "/TENSORFLOW/models/universal-sentence-encoder/". Note the caveat that TensorFlow Server doesn't look for the model file directly in "/TENSORFLOW/models/universal-sentence-encoder/", it expects one more subdirectory with a number indicating the model number, under which the model file exists (the "4" above). Don't worry about this number, by default TensorFlow Server will look for the subdirectory with the highest number and look inside that for the model.
tensorflow/serving: Tells Docker to load and run the TensorFlow Server.
–rest_api_port=8501: Tells TensorFlow Server to use port 8501 to receive requests on (you can change this to whatever you want).
–enable_model_warmup=true: Tells TensorFlow Server to immediately load model data on server startup so it can start processing requests as quickly as possible.

You can change many of these parameters to suit your specific needs, including configuration HTTPS, though we strongly recommend securing the VM to accept only local trusted traffic.

To generate an embedding, you can simply connect to the server locally (or elsewhere within your network) using CURL:

time curl -d '{"instances": ["A video circulating on social media falsely claims that vaccines for COVID-19 have a microchip that tracks the location of the patient. The chip, which is not currently in use, would be attached to the end of a plastic vial and provide information only about the vaccine dose. It cannot track people."]}' -X POST http://localhost:8501/v1/models/universal-sentence-encoder:predict > embedding.json

This will yield the output:

{
"predictions": [[-0.000676348631, 0.0335783474, -0.0667930841, -0.0714375153, -0.0308108795, 0.0107741114, 0.0211005267, 0.0649546608
, 0.0430179, -0.0682501197, 0.080694057, 0.0491752848, 0.0629389435, 0.0166880749, -0.00567102199, -0.0370305069, -0.0797091, -0.00723295
147, -0.0727019385, 0.0361410975, -0.0181389861, 0.0035657594, -0.0738827288, 0.0435053743, -0.0033510346, 0.0637104064, 0.0192584228, -0
.0536565781, 0.0381646752, 0.0399802737, -0.0532769971, 0.0815957189, -0.0247750729, 0.0434365049, 0.0718429685, 0.0798831, -0.0814278647
, 0.0730962753, -0.0390970185, -0.0242322031, 0.000885874149, 0.0159769692, 0.0173619501, -0.0609335974, -0.0577191, -0.00514350412, 0.04
95849475, 0.0545341708, 0.0367251299, 0.00523556164, -0.0726118, -0.0182262957, 0.0283526629, 0.0712847263, -0.0751046464, -0.0359605625,
-0.0463198796, -0.0576271415, 0.0394778885, -0.0719692186, -0.0135769183, 0.0682483837, -0.0386933871, -0.0468093865, 0.0357298516, -0.0
687625632, 0.0324299969, 0.0628880039, -0.0718246102, 0.0315887742, -5.05191238e-05, 0.0415558442, -0.00505858846, -0.022392489, 0.061383
5901, -0.0354572199, 0.0443822891, -0.0483701341, -0.0533702075, 0.0493555963, -0.000873614685, -0.0718877092, -0.020873826, -0.022889973
6, 0.00176751043, -0.0310755726, 0.0193924773, 0.0502826869, 0.00410970859, -0.0537503473, 0.055733379, 0.0284079593, -0.0186564252, 0.02
70265248, -0.000542867, 0.00857701153, 0.0609128438, -0.054550007, -0.00438414281, -0.00450687204, -0.0593304411, 0.0569677204, -0.055725
0082, -0.0581391901, 0.0688386261, 0.0711042285, 0.0656928346, 0.016612241, 0.0630950406, -0.0472512022, -0.0691004619, 0.0108533464, 0.0
0322929746, -0.000946699933, 0.0615085438, -0.0410078727, -0.0152225485, 0.00738444738, -0.00284126587, 0.0394875593, -0.0578187332, -0.0
0346646039, 0.0143317403, -0.00520272, -0.0429149233, -0.0553261, -0.0187431164, -0.0435798056, -0.012326614, 0.0626964, 0.0676322281, 0.
0816315934, 0.0264641885, -0.0353941098, -0.0505655892, -0.0318637304, -0.0114524318, 0.0741548762, -0.0611535534, -0.0655739307, 0.03848
59666, 0.0666911826, -0.0720713735, 0.00292809354, 0.056236, 0.00951228477, 0.0456910245, -0.0751297697, 0.0690657198, 0.0421055295, -0.0
259263497, 0.0098174205, 0.00294160773, -0.0597751215, 0.0550655387, 0.00555063877, 0.0492441654, 0.0539373755, 0.00263481913, 0.05611121
65, -0.0134488028, 0.0405395441, -0.00548824575, -0.0658439845, -0.0258194767, -0.0158555564, 0.00234764232, -0.0125716645, -0.0426669, 0
.0160640404, 0.0515688509, -0.0721045509, -0.0173300859, -0.0293727927, 0.0151059246, 0.0361871794, -0.0112395268, 0.02943906, 0.02101659
03, -0.0214187689, -0.0651167184, -0.0182202179, 0.0265326332, 0.00687541254, 0.0772116631, -0.0304352, -0.00853388291, 0.00589442858, 0.
0255115, -0.0372634716, 0.0135741215, -0.028733423, 0.0289946925, -0.00176235195, 0.0442778431, 0.0459584296, -0.0207500309, -0.016676293
7, 0.0310814567, -0.0484270304, 0.0369233973, 0.00160414423, 0.0190665163, -0.0238827504, 0.0287359972, -0.0704021454, 0.0511950664, 0.02
37841923, 0.0278002098, 0.0673737079, 0.0123101231, -0.00261967396, 0.057817623, 0.0368967168, -0.0033843487, 0.0338038355, 0.0710640848,
-0.00438991282, -0.0024694344, 0.0689607263, -0.0210304018, 0.0174454954, 0.0472253859, 0.0755667239, 0.0417576358, 0.0506599173, -0.044
7778143, -0.0282419603, -0.0505971201, -0.0229050331, -0.0634119362, -0.0248881336, -0.0207892023, -0.0811395496, -0.00757278549, 0.01018
2715, 0.0236734971, -0.00892029889, -0.00252160383, -0.0442645811, 0.0272163153, -0.0416662693, 0.0628380328, -0.0646699443, 0.0498163, 0
.00240475242, -0.0435062647, 0.000996223069, 0.00321698212, -0.0741114616, 0.00510128867, 0.00591130089, -0.0209343582, 0.0536168665, 0.0
632562339, -0.011842425, -0.05333139, 0.0815369189, -0.035677474, -0.0364913084, -0.0239417814, -0.0168477558, 0.0409753472, -0.050018418
6, -0.0302095748, -0.0665327534, 0.0697435886, 0.0697659701, 0.0244960021, -0.0078850314, -0.0170990378, -0.0360420272, -0.0189642347, -0
.0721183345, -0.068311289, 0.0545632094, 0.0556440577, -0.0696792156, 0.0517817736, -0.00504019717, -0.0798536614, -0.0672034398, 0.03576
97979, 0.0733052194, -0.068049036, -0.0617038272, -0.0766119882, 0.057323996, -0.0182283167, -0.0399673171, 0.0584224872, -0.0766021237,
0.0621817335, -0.0264632106, -0.027055122, -0.0209121983, 0.0449912138, 0.0527962334, -0.0161876772, 0.0299768839, -0.0336280614, -0.0015
1602726, -0.0247574784, 0.00560406083, 0.00968513265, -0.0168982428, -0.0741703063, 0.00806702394, -0.0144717386, -0.0171089917, 0.055417
6942, 0.0578010604, -0.0178876836, -0.0129997116, 0.0563468151, 0.0727057606, -0.0110625494, 0.00714441901, 0.0205701515, 0.0505811572, 0
.00863553118, 0.0528230816, 0.00543510355, -0.00437705033, -0.01692109, 0.0757706687, 0.0181356873, 0.0476261675, 0.0106395548, -0.073599
726, 0.0664972886, 0.052257251, 0.0571820363, 0.0235127714, 0.0256717242, -0.0312523283, 0.0306110904, -0.00137786532, -0.0219957978, -0.
0360826813, 0.00508688949, -0.0149257137, 0.0751743838, 0.00596603705, -0.0287195724, -0.0646486282, -0.0529822074, -0.00147497049, -0.07
6780729, 0.0360531174, 0.0681242868, 0.0216353126, -0.00855341926, -0.0221430045, -0.00883254223, 0.00359477359, -0.0758751631, 0.0591248
386, 0.0442272723, -0.0339478217, 0.0403287634, -0.0457744375, 0.0445390902, 0.0611837655, -0.0316450596, -0.0724180341, 0.0200887751, -0
.0319629833, -0.00686091324, 0.0328799523, -0.0706642, 0.0193984825, -0.0390757062, -0.0366525, 0.0576053374, 0.000635154138, 0.043752998
1, -0.0118877217, 0.0606463887, -0.00988362264, 0.0363793708, -0.0747962818, -0.0739434883, 0.0632128641, 0.0612870194, -0.0658576, 0.002
7501646, -0.0456172116, 0.0630888119, 0.0096074, -0.05228002, -0.0643881708, 0.0220250934, 0.00162104797, -0.0456735343, -0.027276488, 0.
0155690443, -0.010482097, 0.0109128309, 0.042261567, 0.0359000601, 0.0244700778, 0.0116510782, -0.0250982307, 0.0581165403, -0.0299764667
, 0.030966159, 0.044659581, -0.0585869811, -0.065470241, -0.0422021113, -0.016235102, -0.0227494035, 0.0732957125, 0.0731576756, -0.00563
122751, -0.0170655493, -0.017218437, 0.0796012431, 0.0504064076, 0.018733697, 0.0466014594, 0.0506017171, 0.0299742259, 0.0236040205, -0.
0534015559, 0.00135049981, 0.0680805594, 0.0222724862, -0.0301939398, -0.0731360614, -0.0251521897, 0.0405843034, 0.0160862375, -0.022523
053, 0.028601069, 0.0229199827, 0.0220593438, -0.0547554269, 0.0578941293, 0.0656009391, -0.0713857859, -0.00748907635, 0.0678754076, -0.
00292494963, -0.0695068836, 0.0549912974, 0.0331136882, -0.0213340595, -0.0493099019, -0.0147924507, -0.0691533834, -0.0378118046, -0.065
3230473, -0.0487434715, 0.029651681, -0.00758803869, -0.0124682365, -0.0759115443, -0.00375634828, 0.0293405913, -0.0534482934, -0.017113
287, 0.0520518832, -0.0630412698, -0.0512841232, -0.0342662744, -0.0549382158, -0.0689640343, 0.0604782961, -0.0227603614, 0.00675819907,
0.0591141097, -0.0453736931, 0.0308122877, -0.0223298352, -0.016205946, 0.0474171452, -0.0703755394, -0.0598351248, 0.0470205843, -0.005
80503233, 0.0221272446, -0.0757156163, 0.0497078598, -0.0315653384, 0.049216032, -0.0386846811, 0.00209847093, -0.0562710315, -0.02081728
35, 0.0727586448, 0.0323538706, -0.0112844286, 0.0476871133, 0.00768467, -0.0123470463, 0.0214785431, 0.02892524, 0.0302119087, 0.0210444
964, 0.0213446226, -0.032723505, -0.0614904463, -0.00649608253, -0.0824379921, -0.0268001184, 0.0077376035, -0.0413497649, 0.0123230452,
-0.0600103587, 0.0127278212]
]
}

Just extract the inner array of 512 floating point numbers as the embedding!

To compute multiple embeddings at once just provide an array of strings:

time curl -d '{"instances": ["query 1", "query 2", "query 3", "query 4"]}' -X POST http://localhost:8501/v1/models/universal-sentence-encoder:predict > embedding.json

This will yield the same output as above, but with an array of arrays.

On a GCP C2 VM we find that batching queries in arrays like above such that each individual POST is roughly 250K in size achieves the highest throughput.

Note that USEv4 embeddings are "approximately normalized" but we have observed embeddings that are not unitized, so recommend that production applications verify that each vector is unitized and L2 normalize if not. You can see a simple BigQuery UDF JavaScript implementation below that computes the cosine similarity while normalizing each vector at the same time. Obviously in a production application you wouldn't want to normalize at querytime, so you would normalize when recording the embedding.

CREATE TEMPORARY FUNCTION cossim(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>)
RETURNS FLOAT64 LANGUAGE js AS '''
var sumt=0, suma=0, sumb=0;
for(i=0;i<a.length;i++) {
sumt += (a[i]*b[i]);
suma += (a[i]*a[i]);
sumb += (b[i]*b[i]);
}
suma = Math.sqrt(suma);
sumb = Math.sqrt(sumb);
return sumt/(suma*sumb);
''';

Computing Similarities At Scale

Once you've converted your query to a USEv4 embedding, how do you efficiently query the GSG dataset? Even just querying a few day's worth of news coverage requires computing millions of 512-dimension cosine similarities, which is extremely slow and computationally demanding.

We will devote a future blog post to various approaches to this problem, but suffice to say that efficiently searching large embedding datasets is an active area of research with an array of approaches to what is known as Approximate Nearest Neighbor (ANN) search.

Some solutions:

Shard embeddings yourself using Locality Sensitive Hashing (LSH). Implementations exist that approximate most similarity metrics including cosine similarity and you can perform additional ranking within each bucket at query time.
Use an off-the-shelf library like Annoy.
Use a cloud-based solution like Vertex Matching Engine.

The optimal solution will depend a lot on your query volume, responsiveness requirements and update speed.

The GDELT Project