Today we are tremendously excited to announce the debut of the Global Similarity Graph Television News Sentence Embeddings, a realtime database of sentence-level embeddings over television news closed captioning covering CNN, MSNBC, Fox News and BBC News London, along with the ABC, CBS and NBC evening news broadcasts using data from the Internet Archive's Television News Archive. This dataset takes the raw station-provided closed captioning stream, collapses it by sentence and then computes an "embedding" for that sentence that is a 512-dimension vector that essentially represents the topical focus of that sentence, collapsing similar-meaning words together. In this way, an article about "semiconductors" and "microchips" would have highly similar representations even though they use different words, because the embedding recognizes that those different words have highly similar meaning. In essence, sentences that talk about similar things have similar vectors – the more similar the vector, the more similar their topical focus, even if they use entirely different wording. This allows us to abstract beyond the precise words used in a sentence towards its actual topical focus.
Based on extensive experimentation, we've selected the Universal Sentence Encoder V4 as our embedding model for this dataset. While this particular member of the USE family is DAN-based, rather than Transformer-based, in our particular use case of semantic similarity comparison of global news coverage, we have not observed substantial accuracy benefit from far more computationally expensive Transformer models and in fact the DAN-based model appears to perform on par or even outperform its Transformer equivalent in many of our tests when compared against human intuition. In randomized spot comparisons, it also appears more robust against captioning error.
As of today we are releasing more than a decade of television news sentence-level Universal Sentence Encoder embeddings totaling more than 189 million records covering the following seven stations:
- BBC News London (2017 – present)
- CNN (July 2009 – present)
- MSNBC (July 2009 – present)
- Fox News (July 2009 – present)
- ABC Evening News (July 2010 – present)
- CBS Evening News (July 2010 – present)
- NBC Evening News (July 2010 – present)
The final dataset consists of one UTF8-encoded JSONNL file per broadcast, where each row is a captioning sentence and has the following fields:
- date. The UTC timestamp the captioning sentence was uttered on air. Note that closed captioning tends to lag several seconds behind the actual words being spoken since captioning is typically typed in realtime by humans watching the live broadcast. Thus, when combining this with other television news datasets, make sure to use the "captioning time" version, such as the Captioning Time Advertising Inventory Files.
- iaIdentifier. The unique Internet Archive Television News Archive identifier for this broadcast.
- station. The television channel the broadcast aired on.
- show. The name of the show being broadcast (such as "CNN Headline News"). Note that long names may be truncated and followed with "…".
- lead. The first two words of the sentence. Embeddings are able to look across the specific words used in a sentence towards its underlying topical focus. Coupled with the fact that captioning typically runs a few seconds behind the video signal, it can be hard to identify which sentence a particular embedding refers to. For example, when performing a similarity search for "microchips," a highly similar captioning line might discuss "semiconductors," potentially making it difficult to readily spot in the Internet Archive's Television News Archive preview which sentence in the captioning is the one being referred to. Thus, by including the first two words of the sentence, you can more readily see in the Archive's preview which captioning sentence is the highly similar one.
- offset. The offset in seconds from the start of the broadcast. This can be used with the Internet Archive's Television News Archive website in order to jump to that point in the broadcast, as the previewUrl field does.
- previewUrl. A link to the Internet Archive's Television News Archive where you can watch a one minute clip of the broadcast beginning around the time the sentence was spoken in order to see the complete sentence and understand its context.
- model. The identifier of the model used to construct the embedding. At this time it is set to "USEv4" for all embeddings.
- sentEmbed. A JSON array of 512 floating point numbers representing the embedding. This is the unnormalized raw embedding provided as-is from the USEv4 model. Most similarity tasks will need to normalize this into a unit vector, such as to perform cosine similarity.
To determine which shows are available for any given day, there is a daily inventory file in the format YYYYMMDD.txt from July 2, 2009 through present. The first available file is:
Each available show from that day is listed on its own line like:
Files for the most recent few days are updated every 20 minutes with the latest shows as they complete processing at the Archive, so refetch the inventory files for the last 4 days every hour or so to find the latest shows.
You can also query the dataset in BigQuery:
We are tremendously excited to see what kinds of powerful new applications this dataset enables!