Today we are tremendously excited to announce the debut of the Global Similarity Graph Document Embeddings, a realtime database of document-level embeddings covering worldwide online news coverage in 65 languages using the Universal Sentence Encoder. This new dataset will enable everything from rich semantic search across the world's news to deep semantic narrative analysis of topical spaces, along with myriad other use cases from "more like this" queries to automated fact check scanning. Imagine taking a breaking news story and searching back over time to contextualize its events or compiling coverage of a given topic and visualizing it according to rich semantic clustering. In place of carefully selected keyword queries, this dataset can even be used to search the news using natural language queries and short article snippets through embedding similarity search!
Based on extensive experimentation, we've selected the Universal Sentence Encoder V4 as our embeddings model for this dataset. While this particular member of the USE family is DAN-based, rather than Transformer-based, in our particular use case of semantic similarity comparison of global news coverage, we have not observed substantial accuracy benefit from far more computationally expensive Transformer models and in fact the DAN-based model appears to perform on par or even outperform its Transformer equivalent in many of our tests when compared against human intuition. It also appears more robust against machine translation error.
Multilingual embedding models do not yet offer truly resource level-independent performance, meaning that similarity scores of nearly identical articles between high- and low-resource languages will typically yield much lower similarity scores than comparisons of moderately-similar articles within one language or across high-resource languages. They are also limited to the languages they were originally trained on. To address these limitations, rather than use a natively multilingual embeddings model, we are applying the English-only base USE model to the machine translations of our 65 live-translated languages. Thus, embeddings for English-language articles reflect the raw USE embedding for that article, while for all other languages, the embeddings are computed from the English machine translation of that article and accuracy will vary according to translation quality.
Unlike most current embedding workflows, we leverage the extreme speed of the DAN-based USE model to generate whole-document embeddings that reflect the entire text of each article, making them document-level embeddings. This uniquely allows them to peer past the lead+last paragraph approaches often used to overcome token size or computational demands of many embedding models and can thus capture the entire range of an article. Thus, an article that opens and closes with a discussion of recent CDC masking guidance, but whose middle 80% discusses vaccine efficacy against delta will correctly encode that central vaccine discussion that would be missed by a model that only looked at the lead and last paragraphs. Note that document-level embeddings are an active area of research and thus these embeddings reflect only one specific approach to this problem.
While most similarity tasks require normalized embedding vectors, we output the raw embedding as-is from the USEv4 model to enable use cases that need to access the raw magnitude information.
Each file is a UTF8-encoded JSONNL file, with one article per row and each row containing the following fields:
- date. Date/time the article was first seen.
- url. The URL of the article.
- lang. The CLD2 human name for the article's primary language.
- title. The title of the article.
- model. The identifier of the model used to construct the embedding. At this time it is set to "USEv4" for all embeddings.
- docembed. A JSON array of 512 floating point numbers representing the embedding. This is the unnormalized raw embedding provided as-is from the USEv4 model. Most similarity tasks will need to normalize this into a unit vector, such as to perform cosine similarity.
At this time the dataset updates every 15 minutes, with the latest file typically being published 30 minutes after the given date. Thus, at 2PM EST, you would download the file from 1:30PM EST. However, over the coming months this dataset will transition to update every minute, so scheduled downloading workflows should be designed such that they can be transitioned to download every 60 seconds when this dataset transitions to one minute updates.
The dataset can be downloaded directly every 15 minutes as a JSON file with the following URL structure, with the date represented as "YYYYMMDDHHMMSS" in the UTC timezone. The first available file is seen below.
You can also query the dataset in BigQuery:
We are tremendously excited to see what kinds of powerful new applications this dataset enables!