Last March we explored how the Google AI Workshop "Semantic Similarity For Natural Language" API could be used to identify semantic discontinuities in television news broadcast transcripts as a first step towards being able to segment continuous broadcasts into discrete "stories." As we continue these explorations, our latest experiment involves applying the pretrained Universal Sentence Encoder "Large" model version 5 to television news transcripts. We use the pretrained downloadable model as-is with no additional tuning or domain adaptation. The model accepts as input a short textual input and outputs a 512-dimension sentence-level embedding.
We applied two different approaches to the 3:30PM PST broadcast of ABC World News Tonight With David Muir on January 15, 2020.
- Single Line. Each line of the transcript was converted to its sentence embedding vector and the cosine similarity was computed between it and the vector of the previous sentence.
- Dual Line. Every two lines of the transcript were concatenated together and converted to a sentence embedding vector and the cosine similarity was computed between it and the vector of the previous two sentences. Sentences are not overlapped, so the input looks like Line1+Line2 <-> Line3+Line4 and then Line5+Line6 <-> Line7+Line8 and so on.
You can download the final similarity scores (each row is a sentence or two-sentence combination and several rolling averages are included in the columns). The first tab includes the points at which the major stories occurred. Note that we did not exclude advertisements here, but in a production application the AIF dataset would be used to mask them.
While not perfect, the results are quite promising, suggesting that there is significant potential in this approach. From further experimentation, it appears that a critical decision point will be the size of the input text and whether a better aggregation level might be seconds of airtime rather than transcript lines, such as dividing the broadcast into 15 second segments.
The stream-of-consciousness speech of broadcast transcripts suggests even greater accuracy would likely be achieved through domain adaptation and even potentially incorporating visual cues, such as camera changes as a weighing signal (since camera changes by themselves often appear multiple times within a single story segment, but added to transcripts might lend additional information).
Once we determine the ideal segment size for the input text, another possibility would be to publish the raw output vectors for each segment, which would permit global-scale similarity computation and clustering. For example, if the past decade of CNN coverage was divided into 15 second segments and the raw 512-dimension Universal Sentence Encoder vectors output for each 15 second segment, you could select any 15 second segment and identify all semantically-similar segments from the past decade, looking across years to group not by words, but by topical discussion. Similarity, you could easily cluster and visualize CNN coverage, identifying major narrative clusters and even compare them across shows, topics and channels!
We're excited to see where this work takes us!