The GDELT Project

Tips For Building A Query Service Using The New Global Similarity Graph Document Embeddings

Given all of the interest and questions we've heard about how to build production-scale query services using the new Global Similarity Graph Document Embeddings dataset, we are releasing this list of tips to help get you started!

Creating Query Embeddings

The first step in working with the embeddings is to efficiently convert your input queries into the same Universal Sentence Encoder embedding used in the dataset. We are using the Universal Sentence Encoder v4 model to create our embeddings. To query the GSG dataset you must use exactly the same model the dataset uses, since every embedding model produces different embeddings.

Limitations Of Embeddings

It is important to recognize the strengths and limitations of embeddings:

In practice, the limitations above mean for many use cases you may want to combine the GSG with the GKG, GEG or other GDELT datasets (using the article URL as the unique join key) to perform additional filtering. For example, you could search the GKG for vaccine-related coverage and then use the GSG to cluster that coverage.

Hardware Requirements

We are using the DAN-based USE model, which is extremely efficient and does not require hardware acceleration in the form of a GPU or TPU, only a CPU. It benefits strongly from faster processors (in GCP the C2 family provides a 2x speedup over E2 VMs), but requires only CPU resources to run, offering maximal flexibility in running the model anywhere. The particular workflow we use is optimized for CPUs that support AVX512 instruction sets ("optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA"), so that may also factor into your processor selection.

Most importantly, this particular DAN model has effectively linear inference time with input length, meaning you can convert an entire document into an embedding at almost the same speed as a short search engine query. This makes it easy to use more complex inputs for your queries and means that even if computing embeddings for large inputs you do not need accelerators.

Creating One-Off Embeddings

When first experimenting with the dataset, you can use the free Colab service to manually convert a textual passage into a USEv4 embedding to prototype different ideas. Just create a free new Colab notebook and run the following code (replace the sample sentence with any string:

#load libraries...
import tensorflow_hub as hub
import tensorflow as tf
!pip install tensorflow_text
import tensorflow_text as text # Needed for loading universal-sentence-encoder-cmlm/multilingual-preprocess
import numpy as np

#normalize...
def normalization(embeds):
norms = np.linalg.norm(embeds, 2, axis=1, keepdims=True)
return embeds/norms

sent = tf.constant(["A video circulating on social media falsely claims that vaccines for COVID-19 have a microchip that “tracks the location of the patient.” The chip, which is not currently in use, would be attached to the end of a plastic vial and provide information only about the vaccine dose. It cannot track people."])
embed_use = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
sente = embed_use(sent)
sente = normalization(sente)
print(repr(sente))

The output will be a 512-dimension array:

<tf.Tensor: shape=(1, 512), dtype=float32, numpy= array([[-6.76356838e-04, 3.35783325e-02, -6.67930841e-02, -7.14375153e-02, -3.08108889e-02, 1.07741086e-02, 2.11005341e-02, 6.49546608e-02, 4.30178866e-02, -6.82501197e-02, 8.06940496e-02, 4.91752811e-02, 6.29389361e-02, 1.66880507e-02, -5.67101641e-03, -3.70305106e-02, -7.97090977e-02, -7.23296637e-03, -7.27019385e-02, 3.61410975e-02, -1.81389842e-02, 3.56576918e-03, -7.38827288e-02, 4.35053706e-02, -3.35104694e-03, 6.37104064e-02, 1.92584172e-02, -5.36565781e-02, 3.81646678e-02, 3.99802737e-02, -5.32769971e-02, 8.15957189e-02, -2.47750692e-02, 4.34365049e-02, 7.18429685e-02, 7.98831061e-02, -8.14278647e-02, 7.30962753e-02, -3.90970185e-02, -2.42321957e-02, 8.85859481e-04, 1.59769729e-02, 1.73619445e-02, -6.09335937e-02, -5.77191003e-02, -5.14351763e-03, 4.95849364e-02, 5.45341708e-02, 3.67251299e-02, 5.23556210e-03, -7.26118013e-02, -1.82263218e-02, 2.83526741e-02, 7.12847263e-02, -7.51046464e-02, -3.59605625e-02, -4.63198870e-02, -5.76271117e-02, 3.94778773e-02, -7.19692186e-02, -1.35769062e-02, 6.82483837e-02, -3.86933982e-02, -4.68094014e-02, 3.57298478e-02, -6.87625632e-02, 3.24299969e-02, 6.28880039e-02, -7.18246251e-02, 3.15887854e-02, -5.05154385e-05, 4.15558480e-02, -5.05858241e-03, -2.23924946e-02, 6.13835901e-02, -3.54572162e-02, 4.43822816e-02, -4.83701378e-02, -5.33702224e-02, 4.93556038e-02, -8.73594370e-04, -7.18877092e-02, -2.08738167e-02, -2.28899792e-02, 1.76748831e-03, -3.10755782e-02, 1.93924904e-02, 5.02826758e-02, 4.10971418e-03, -5.37503473e-02, 5.57333715e-02, 2.84079444e-02, -1.86564196e-02, 2.70265304e-02, -5.42857102e-04, 8.57702456e-03, 6.09128438e-02, -5.45500219e-02, -4.38415771e-03, -4.50687436e-03, -5.93304411e-02, 5.69677241e-02, -5.57250157e-02, -5.81391938e-02, 6.88386261e-02, 7.11042359e-02, 6.56928271e-02, 1.66122485e-02, 6.30950481e-02, -4.72512022e-02, -6.91004544e-02, 1.08533464e-02, 3.22929490e-03, -9.46690096e-04, 6.15085438e-02, -4.10078615e-02, -1.52225317e-02, 7.38443527e-03, -2.84125120e-03, 3.94875668e-02, -5.78187183e-02, -3.46644572e-03, 1.43317459e-02, -5.20273345e-03, -4.29149270e-02, -5.53261004e-02, -1.87431239e-02, -4.35798094e-02, -1.23266215e-02, 6.26963973e-02, 6.76322281e-02, 8.16316083e-02, 2.64641959e-02, -3.53941061e-02, -5.05655743e-02, -3.18637267e-02, -1.14524448e-02, 7.41548762e-02, -6.11535534e-02, -6.55739233e-02, 3.84859666e-02, 6.66911826e-02, -7.20713809e-02, 2.92808632e-03, 5.62359951e-02, 9.51226894e-03, 4.56910357e-02, -7.51297697e-02, 6.90657347e-02, 4.21055108e-02, -2.59263385e-02, 9.81741399e-03, 2.94160913e-03, -5.97751215e-02, 5.50655387e-02, 5.55065367e-03, 4.92441691e-02, 5.39373793e-02, 2.63481354e-03, 5.61112165e-02, -1.34488298e-02, 4.05395217e-02, -5.48825506e-03, -6.58439845e-02, -2.58194543e-02, -1.58555657e-02, 2.34762882e-03, -1.25716645e-02, -4.26669009e-02, 1.60640255e-02, 5.15688658e-02, -7.21045434e-02, -1.73300430e-02, -2.93727703e-02, 1.51059516e-02, 3.61871794e-02, -1.12395249e-02, 2.94390563e-02, 2.10165903e-02, -2.14187857e-02, -6.51167110e-02, -1.82202216e-02, 2.65326332e-02, 6.87541533e-03, 7.72116631e-02, -3.04351989e-02, -8.53389502e-03, 5.89442579e-03, 2.55114846e-02, -3.72634716e-02, 1.35741597e-02, -2.87334360e-02, 2.89946962e-02, -1.76235684e-03, 4.42778356e-02, 4.59584370e-02, -2.07500421e-02, -1.66762974e-02, 3.10814641e-02, -4.84270193e-02, 3.69234122e-02, 1.60416390e-03, 1.90665293e-02, -2.38827430e-02, 2.87359916e-02, -7.04021528e-02, 5.11950590e-02, 2.37841941e-02, 2.78002135e-02, 6.73737004e-02, 1.23101231e-02, -2.61966907e-03, 5.78176118e-02, 3.68967131e-02, -3.38435126e-03, 3.38038430e-02, 7.10640773e-02, -4.38991282e-03, -2.46944465e-03, 6.89607188e-02, -2.10304111e-02, 1.74455028e-02, 4.72253896e-02, 7.55667314e-02, 4.17576358e-02, 5.06599247e-02, -4.47778143e-02, -2.82419380e-02, -5.05971462e-02, -2.29050219e-02, -6.34119362e-02, -2.48881299e-02, -2.07891967e-02, -8.11395496e-02, -7.57279713e-03, 1.01827094e-02, 2.36734990e-02, -8.92031100e-03, -2.52159638e-03, -4.42645922e-02, 2.72163115e-02, -4.16662544e-02, 6.28380328e-02, -6.46699443e-02, 4.98163030e-02, 2.40474264e-03, -4.35062796e-02, 9.96236573e-04, 3.21699865e-03, -7.41114616e-02, 5.10127423e-03, 5.91132231e-03, -2.09343582e-02, 5.36168702e-02, 6.32562339e-02, -1.18424380e-02, -5.33314049e-02, 8.15369189e-02, -3.56774591e-02, -3.64913158e-02, -2.39417814e-02, -1.68477502e-02, 4.09753472e-02, -5.00184186e-02, -3.02095693e-02, -6.65327683e-02, 6.97435886e-02, 6.97659627e-02, 2.44959928e-02, -7.88502675e-03, -1.70990340e-02, -3.60420384e-02, -1.89642422e-02, -7.21183345e-02, -6.83112964e-02, 5.45631945e-02, 5.56440577e-02, -6.96792156e-02, 5.17817736e-02, -5.04019484e-03, -7.98536614e-02, -6.72034398e-02, 3.57697830e-02, 7.33052269e-02, -6.80490360e-02, -6.17038347e-02, -7.66119808e-02, 5.73239997e-02, -1.82283260e-02, -3.99673171e-02, 5.84224798e-02, -7.66021237e-02, 6.21817261e-02, -2.64632311e-02, -2.70551220e-02, -2.09122039e-02, 4.49912027e-02, 5.27962260e-02, -1.61876865e-02, 2.99768839e-02, -3.36280465e-02, -1.51605671e-03, -2.47574896e-02, 5.60405292e-03, 9.68514197e-03, -1.68982614e-02, -7.41703138e-02, 8.06703139e-03, -1.44717349e-02, -1.71089768e-02, 5.54176830e-02, 5.78010641e-02, -1.78876668e-02, -1.29997097e-02, 5.63468151e-02, 7.27057606e-02, -1.10625252e-02, 7.14442134e-03, 2.05701385e-02, 5.05811423e-02, 8.63553584e-03, 5.28230928e-02, 5.43508772e-03, -4.37706057e-03, -1.69210937e-02, 7.57706687e-02, 1.81356780e-02, 4.76261452e-02, 1.06395511e-02, -7.35997260e-02, 6.64972737e-02, 5.22572510e-02, 5.71820438e-02, 2.35127471e-02, 2.56717186e-02, -3.12523060e-02, 3.06110885e-02, -1.37786288e-03, -2.19957810e-02, -3.60827036e-02, 5.08690346e-03, -1.49257006e-02, 7.51743838e-02, 5.96603751e-03, -2.87195779e-02, -6.46486282e-02, -5.29822260e-02, -1.47496245e-03, -7.67807290e-02, 3.60531062e-02, 6.81242943e-02, 2.16352921e-02, -8.55341647e-03, -2.21430194e-02, -8.83253198e-03, 3.59478197e-03, -7.58751631e-02, 5.91248423e-02, 4.42272760e-02, -3.39478180e-02, 4.03287634e-02, -4.57744524e-02, 4.45390902e-02, 6.11837544e-02, -3.16450559e-02, -7.24180341e-02, 2.00887918e-02, -3.19629908e-02, -6.86090300e-03, 3.28799486e-02, -7.06641898e-02, 1.93985011e-02, -3.90757024e-02, -3.66524868e-02, 5.76053411e-02, 6.35175093e-04, 4.37529907e-02, -1.18877320e-02, 6.06463850e-02, -9.88359284e-03, 3.63793671e-02, -7.47962818e-02, -7.39435032e-02, 6.32128567e-02, 6.12870194e-02, -6.58575967e-02, 2.75015971e-03, -4.56172265e-02, 6.30888119e-02, 9.60739609e-03, -5.22800013e-02, -6.43881708e-02, 2.20250804e-02, 1.62106263e-03, -4.56735268e-02, -2.72764824e-02, 1.55690350e-02, -1.04821082e-02, 1.09128319e-02, 4.22615670e-02, 3.59000675e-02, 2.44700797e-02, 1.16510596e-02, -2.50982083e-02, 5.81165403e-02, -2.99764648e-02, 3.09661478e-02, 4.46595661e-02, -5.85869774e-02, -6.54702336e-02, -4.22021002e-02, -1.62350927e-02, -2.27494091e-02, 7.32957125e-02, 7.31576756e-02, -5.63123915e-03, -1.70655418e-02, -1.72184445e-02, 7.96012357e-02, 5.04064001e-02, 1.87336896e-02, 4.66014594e-02, 5.06016947e-02, 2.99742296e-02, 2.36040205e-02, -5.34015521e-02, 1.35052192e-03, 6.80805445e-02, 2.22724825e-02, -3.01939417e-02, -7.31360614e-02, -2.51521859e-02, 4.05842923e-02, 1.60862431e-02, -2.25230474e-02, 2.86010765e-02, 2.29199734e-02, 2.20593587e-02, -5.47554232e-02, 5.78941219e-02, 6.56009391e-02, -7.13857934e-02, -7.48909358e-03, 6.78754002e-02, -2.92496174e-03, -6.95068836e-02, 5.49913049e-02, 3.31136771e-02, -2.13340521e-02, -4.93099019e-02, -1.47924507e-02, -6.91533834e-02, -3.78118120e-02, -6.53230399e-02, -4.87434752e-02, 2.96516586e-02, -7.58804101e-03, -1.24682412e-02, -7.59115368e-02, -3.75635456e-03, 2.93405913e-02, -5.34483194e-02, -1.71132796e-02, 5.20518832e-02, -6.30412847e-02, -5.12841195e-02, -3.42662632e-02, -5.49382344e-02, -6.89640343e-02, 6.04783110e-02, -2.27603670e-02, 6.75819349e-03, 5.91140948e-02, -4.53736931e-02, 3.08122877e-02, -2.23298520e-02, -1.62059460e-02, 4.74171452e-02, -7.03755468e-02, -5.98351210e-02, 4.70205843e-02, -5.80502860e-03, 2.21272446e-02, -7.57156089e-02, 4.97078523e-02, -3.15653495e-02, 4.92160209e-02, -3.86846699e-02, 2.09846185e-03, -5.62710315e-02, -2.08172686e-02, 7.27586523e-02, 3.23538706e-02, -1.12844165e-02, 4.76871207e-02, 7.68466992e-03, -1.23470398e-02, 2.14785617e-02, 2.89252382e-02, 3.02119087e-02, 2.10444834e-02, 2.13446151e-02, -3.27234976e-02, -6.14904426e-02, -6.49609463e-03, -8.24379921e-02, -2.68000960e-02, 7.73761328e-03, -4.13497761e-02, 1.23230414e-02, -6.00103587e-02, 1.27278063e-02]], dtype=float32)>

The inner array of float32's is the embedding of that particular string under the USEv4 model.

Production Embedding Generation

In a production application you obviously need a more automated workflow that supports high-volume embedding computation. Thankfully, it is trivial to deploy USEv4 inside of Tensorflow Server in docker container to deploy a RESTful server that supports high-volume embedding computation. In fact, this workflow is so efficient that the limiting bottleneck rapidly becomes the Docker networking layer, so it is important to use the "–net=host" parameter to achieve a 20% speedup for high-volume use cases.

To start from scratch on GCP, spin up a new C2 VM (this will provide a 2x speedup over E2 VMs but you can also use E2 VM's as needed). The USEv4 model is so efficient that you won't need many cores. Internally we use a 4-core C2 VM to generate all of our embeddings with spare CPU capacity to run other models on the same machine.

Create a new directory on the VM:

/TENSORFLOW/models/universal-sentence-encoder/4/

And click the orange "Download" button on the Universal Sentence Encoder v4 model TFHub page to download the 1GB compressed model to that directory. Then unpack it to that directory. When you are done it should look like:

$find /TENSORFLOW/models/universal-sentence-encoder/4/
/TENSORFLOW/models/universal-sentence-encoder/4/
/TENSORFLOW/models/universal-sentence-encoder/4/assets
/TENSORFLOW/models/universal-sentence-encoder/4/variables
/TENSORFLOW/models/universal-sentence-encoder/4/variables/variables.data-00000-of-00001
/TENSORFLOW/models/universal-sentence-encoder/4/variables/variables.index
/TENSORFLOW/models/universal-sentence-encoder/4/saved_model.pb

Install Docker on your VM and then run the following command to install and run TensorFlow Server running the USEv4 model:

docker run -t --net=host --restart always --name tf-serve-universal-sentence-encoder -v "/TENSORFLOW/models:/models" -e MODEL_NAME="universal-sentence-encoder" -t tensorflow/serving --rest_api_port=8501 --enable_model_warmup=true&

This single command will download the TF Server to your VM and run it and in a matter of seconds you will have a production-capable embedding server running!

Here's an explanation of the parameters we use:

You can change many of these parameters to suit your specific needs, including configuration HTTPS, though we strongly recommend securing the VM to accept only local trusted traffic.

To generate an embedding, you can simply connect to the server locally (or elsewhere within your network) using CURL:

time curl -d '{"instances": ["A video circulating on social media falsely claims that vaccines for COVID-19 have a microchip that tracks the location of the patient. The chip, which is not currently in use, would be attached to the end of a plastic vial and provide information only about the vaccine dose. It cannot track people."]}' -X POST http://localhost:8501/v1/models/universal-sentence-encoder:predict > embedding.json

This will yield the output:

{
"predictions": [[-0.000676348631, 0.0335783474, -0.0667930841, -0.0714375153, -0.0308108795, 0.0107741114, 0.0211005267, 0.0649546608
, 0.0430179, -0.0682501197, 0.080694057, 0.0491752848, 0.0629389435, 0.0166880749, -0.00567102199, -0.0370305069, -0.0797091, -0.00723295
147, -0.0727019385, 0.0361410975, -0.0181389861, 0.0035657594, -0.0738827288, 0.0435053743, -0.0033510346, 0.0637104064, 0.0192584228, -0
.0536565781, 0.0381646752, 0.0399802737, -0.0532769971, 0.0815957189, -0.0247750729, 0.0434365049, 0.0718429685, 0.0798831, -0.0814278647
, 0.0730962753, -0.0390970185, -0.0242322031, 0.000885874149, 0.0159769692, 0.0173619501, -0.0609335974, -0.0577191, -0.00514350412, 0.04
95849475, 0.0545341708, 0.0367251299, 0.00523556164, -0.0726118, -0.0182262957, 0.0283526629, 0.0712847263, -0.0751046464, -0.0359605625,
-0.0463198796, -0.0576271415, 0.0394778885, -0.0719692186, -0.0135769183, 0.0682483837, -0.0386933871, -0.0468093865, 0.0357298516, -0.0
687625632, 0.0324299969, 0.0628880039, -0.0718246102, 0.0315887742, -5.05191238e-05, 0.0415558442, -0.00505858846, -0.022392489, 0.061383
5901, -0.0354572199, 0.0443822891, -0.0483701341, -0.0533702075, 0.0493555963, -0.000873614685, -0.0718877092, -0.020873826, -0.022889973
6, 0.00176751043, -0.0310755726, 0.0193924773, 0.0502826869, 0.00410970859, -0.0537503473, 0.055733379, 0.0284079593, -0.0186564252, 0.02
70265248, -0.000542867, 0.00857701153, 0.0609128438, -0.054550007, -0.00438414281, -0.00450687204, -0.0593304411, 0.0569677204, -0.055725
0082, -0.0581391901, 0.0688386261, 0.0711042285, 0.0656928346, 0.016612241, 0.0630950406, -0.0472512022, -0.0691004619, 0.0108533464, 0.0
0322929746, -0.000946699933, 0.0615085438, -0.0410078727, -0.0152225485, 0.00738444738, -0.00284126587, 0.0394875593, -0.0578187332, -0.0
0346646039, 0.0143317403, -0.00520272, -0.0429149233, -0.0553261, -0.0187431164, -0.0435798056, -0.012326614, 0.0626964, 0.0676322281, 0.
0816315934, 0.0264641885, -0.0353941098, -0.0505655892, -0.0318637304, -0.0114524318, 0.0741548762, -0.0611535534, -0.0655739307, 0.03848
59666, 0.0666911826, -0.0720713735, 0.00292809354, 0.056236, 0.00951228477, 0.0456910245, -0.0751297697, 0.0690657198, 0.0421055295, -0.0
259263497, 0.0098174205, 0.00294160773, -0.0597751215, 0.0550655387, 0.00555063877, 0.0492441654, 0.0539373755, 0.00263481913, 0.05611121
65, -0.0134488028, 0.0405395441, -0.00548824575, -0.0658439845, -0.0258194767, -0.0158555564, 0.00234764232, -0.0125716645, -0.0426669, 0
.0160640404, 0.0515688509, -0.0721045509, -0.0173300859, -0.0293727927, 0.0151059246, 0.0361871794, -0.0112395268, 0.02943906, 0.02101659
03, -0.0214187689, -0.0651167184, -0.0182202179, 0.0265326332, 0.00687541254, 0.0772116631, -0.0304352, -0.00853388291, 0.00589442858, 0.
0255115, -0.0372634716, 0.0135741215, -0.028733423, 0.0289946925, -0.00176235195, 0.0442778431, 0.0459584296, -0.0207500309, -0.016676293
7, 0.0310814567, -0.0484270304, 0.0369233973, 0.00160414423, 0.0190665163, -0.0238827504, 0.0287359972, -0.0704021454, 0.0511950664, 0.02
37841923, 0.0278002098, 0.0673737079, 0.0123101231, -0.00261967396, 0.057817623, 0.0368967168, -0.0033843487, 0.0338038355, 0.0710640848,
-0.00438991282, -0.0024694344, 0.0689607263, -0.0210304018, 0.0174454954, 0.0472253859, 0.0755667239, 0.0417576358, 0.0506599173, -0.044
7778143, -0.0282419603, -0.0505971201, -0.0229050331, -0.0634119362, -0.0248881336, -0.0207892023, -0.0811395496, -0.00757278549, 0.01018
2715, 0.0236734971, -0.00892029889, -0.00252160383, -0.0442645811, 0.0272163153, -0.0416662693, 0.0628380328, -0.0646699443, 0.0498163, 0
.00240475242, -0.0435062647, 0.000996223069, 0.00321698212, -0.0741114616, 0.00510128867, 0.00591130089, -0.0209343582, 0.0536168665, 0.0
632562339, -0.011842425, -0.05333139, 0.0815369189, -0.035677474, -0.0364913084, -0.0239417814, -0.0168477558, 0.0409753472, -0.050018418
6, -0.0302095748, -0.0665327534, 0.0697435886, 0.0697659701, 0.0244960021, -0.0078850314, -0.0170990378, -0.0360420272, -0.0189642347, -0
.0721183345, -0.068311289, 0.0545632094, 0.0556440577, -0.0696792156, 0.0517817736, -0.00504019717, -0.0798536614, -0.0672034398, 0.03576
97979, 0.0733052194, -0.068049036, -0.0617038272, -0.0766119882, 0.057323996, -0.0182283167, -0.0399673171, 0.0584224872, -0.0766021237,
0.0621817335, -0.0264632106, -0.027055122, -0.0209121983, 0.0449912138, 0.0527962334, -0.0161876772, 0.0299768839, -0.0336280614, -0.0015
1602726, -0.0247574784, 0.00560406083, 0.00968513265, -0.0168982428, -0.0741703063, 0.00806702394, -0.0144717386, -0.0171089917, 0.055417
6942, 0.0578010604, -0.0178876836, -0.0129997116, 0.0563468151, 0.0727057606, -0.0110625494, 0.00714441901, 0.0205701515, 0.0505811572, 0
.00863553118, 0.0528230816, 0.00543510355, -0.00437705033, -0.01692109, 0.0757706687, 0.0181356873, 0.0476261675, 0.0106395548, -0.073599
726, 0.0664972886, 0.052257251, 0.0571820363, 0.0235127714, 0.0256717242, -0.0312523283, 0.0306110904, -0.00137786532, -0.0219957978, -0.
0360826813, 0.00508688949, -0.0149257137, 0.0751743838, 0.00596603705, -0.0287195724, -0.0646486282, -0.0529822074, -0.00147497049, -0.07
6780729, 0.0360531174, 0.0681242868, 0.0216353126, -0.00855341926, -0.0221430045, -0.00883254223, 0.00359477359, -0.0758751631, 0.0591248
386, 0.0442272723, -0.0339478217, 0.0403287634, -0.0457744375, 0.0445390902, 0.0611837655, -0.0316450596, -0.0724180341, 0.0200887751, -0
.0319629833, -0.00686091324, 0.0328799523, -0.0706642, 0.0193984825, -0.0390757062, -0.0366525, 0.0576053374, 0.000635154138, 0.043752998
1, -0.0118877217, 0.0606463887, -0.00988362264, 0.0363793708, -0.0747962818, -0.0739434883, 0.0632128641, 0.0612870194, -0.0658576, 0.002
7501646, -0.0456172116, 0.0630888119, 0.0096074, -0.05228002, -0.0643881708, 0.0220250934, 0.00162104797, -0.0456735343, -0.027276488, 0.
0155690443, -0.010482097, 0.0109128309, 0.042261567, 0.0359000601, 0.0244700778, 0.0116510782, -0.0250982307, 0.0581165403, -0.0299764667
, 0.030966159, 0.044659581, -0.0585869811, -0.065470241, -0.0422021113, -0.016235102, -0.0227494035, 0.0732957125, 0.0731576756, -0.00563
122751, -0.0170655493, -0.017218437, 0.0796012431, 0.0504064076, 0.018733697, 0.0466014594, 0.0506017171, 0.0299742259, 0.0236040205, -0.
0534015559, 0.00135049981, 0.0680805594, 0.0222724862, -0.0301939398, -0.0731360614, -0.0251521897, 0.0405843034, 0.0160862375, -0.022523
053, 0.028601069, 0.0229199827, 0.0220593438, -0.0547554269, 0.0578941293, 0.0656009391, -0.0713857859, -0.00748907635, 0.0678754076, -0.
00292494963, -0.0695068836, 0.0549912974, 0.0331136882, -0.0213340595, -0.0493099019, -0.0147924507, -0.0691533834, -0.0378118046, -0.065
3230473, -0.0487434715, 0.029651681, -0.00758803869, -0.0124682365, -0.0759115443, -0.00375634828, 0.0293405913, -0.0534482934, -0.017113
287, 0.0520518832, -0.0630412698, -0.0512841232, -0.0342662744, -0.0549382158, -0.0689640343, 0.0604782961, -0.0227603614, 0.00675819907,
0.0591141097, -0.0453736931, 0.0308122877, -0.0223298352, -0.016205946, 0.0474171452, -0.0703755394, -0.0598351248, 0.0470205843, -0.005
80503233, 0.0221272446, -0.0757156163, 0.0497078598, -0.0315653384, 0.049216032, -0.0386846811, 0.00209847093, -0.0562710315, -0.02081728
35, 0.0727586448, 0.0323538706, -0.0112844286, 0.0476871133, 0.00768467, -0.0123470463, 0.0214785431, 0.02892524, 0.0302119087, 0.0210444
964, 0.0213446226, -0.032723505, -0.0614904463, -0.00649608253, -0.0824379921, -0.0268001184, 0.0077376035, -0.0413497649, 0.0123230452,
-0.0600103587, 0.0127278212]
]
}

Just extract the inner array of 512 floating point numbers as the embedding!

To compute multiple embeddings at once just provide an array of strings:

time curl -d '{"instances": ["query 1", "query 2", "query 3", "query 4"]}' -X POST http://localhost:8501/v1/models/universal-sentence-encoder:predict > embedding.json

This will yield the same output as above, but with an array of arrays.

On a GCP C2 VM we find that batching queries in arrays like above such that each individual POST is roughly 250K in size achieves the highest throughput.

Note that USEv4 embeddings are "approximately normalized" but we have observed embeddings that are not unitized, so recommend that production applications verify that each vector is unitized and L2 normalize if not. You can see a simple BigQuery UDF JavaScript implementation below that computes the cosine similarity while normalizing each vector at the same time. Obviously in a production application you wouldn't want to normalize at querytime, so you would normalize when recording the embedding.

CREATE TEMPORARY FUNCTION cossim(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>)
RETURNS FLOAT64 LANGUAGE js AS '''
var sumt=0, suma=0, sumb=0;
for(i=0;i<a.length;i++) {
sumt += (a[i]*b[i]);
suma += (a[i]*a[i]);
sumb += (b[i]*b[i]);
}
suma = Math.sqrt(suma);
sumb = Math.sqrt(sumb);
return sumt/(suma*sumb);
''';

Computing Similarities At Scale

Once you've converted your query to a USEv4 embedding, how do you efficiently query the GSG dataset? Even just querying a few day's worth of news coverage requires computing millions of 512-dimension cosine similarities, which is extremely slow and computationally demanding.

We will devote a future blog post to various approaches to this problem, but suffice to say that efficiently searching large embedding datasets is an active area of research with an array of approaches to what is known as Approximate Nearest Neighbor (ANN) search.

Some solutions:

The optimal solution will depend a lot on your query volume, responsiveness requirements and update speed.