The GDELT Project

Announcing The Global Similarity Graph Television News Sentence Embeddings Using The Universal Sentence Encoder

Today we are tremendously excited to announce the debut of the Global Similarity Graph Television News Sentence Embeddings, a realtime database of sentence-level embeddings over television news closed captioning covering CNN, MSNBC, Fox News and BBC News London, along with the ABC, CBS and NBC evening news broadcasts using data from the Internet Archive's Television News Archive. This dataset takes the raw station-provided closed captioning stream, collapses it by sentence and then computes an "embedding" for that sentence that is a 512-dimension vector that essentially represents the topical focus of that sentence, collapsing similar-meaning words together. In this way, an article about "semiconductors" and "microchips" would have highly similar representations even though they use different words, because the embedding recognizes that those different words have highly similar meaning. In essence, sentences that talk about similar things have similar vectors – the more similar the vector, the more similar their topical focus, even if they use entirely different wording. This allows us to abstract beyond the precise words used in a sentence towards its actual topical focus.

Based on extensive experimentation, we've selected the Universal Sentence Encoder V4 as our embedding model for this dataset. While this particular member of the USE family is DAN-based, rather than Transformer-based, in our particular use case of semantic similarity comparison of global news coverage, we have not observed substantial accuracy benefit from far more computationally expensive Transformer models and in fact the DAN-based model appears to perform on par or even outperform its Transformer equivalent in many of our tests when compared against human intuition. In randomized spot comparisons, it also appears more robust against captioning error.

As of today we are releasing more than a decade of television news sentence-level Universal Sentence Encoder embeddings totaling more than 189 million records covering the following seven stations:

The final dataset consists of one UTF8-encoded JSONNL file per broadcast, where each row is a captioning sentence and has the following fields:

To determine which shows are available for any given day, there is a daily inventory file in the format YYYYMMDD.txt from July 2, 2009 through present. The first available file is:

Each available show from that day is listed on its own line like:

Files for the most recent few days are updated every 20 minutes with the latest shows as they complete processing at the Archive, so refetch the inventory files for the last 4 days every hour or so to find the latest shows.

You can also query the dataset in BigQuery:

We are tremendously excited to see what kinds of powerful new applications this dataset enables!