Google's AI Workshop offers an incredibly rich collection of research experiments spanning the frontiers of AI research today. Unlike Google's production AI services and APIs, its AI Workshop is a place where it previews active areas of frontier research it is working on, allowing its research teams to engage with advanced external users to test new ideas, gather feedback and evolve pioneering ideas and tools that are advancing from the research lab to real-world applications.
One of these experiments is Google's "Semantic Similarity For Natural Language" API which takes a pair of sentences and calculates the probability that the former entails the latter. A pair of sentences like "What's the weather like outside?" and "Golden retrievers are dogs" will be scored as highly unrelated, whereas "What's the weather like outside?" and "It looks like it might rain in a bit" will be scored as very similar and that the first sentence entails the second (order matters).
Unlike traditional similarity metrics, the Similarity API abstracts away from the literal words used in each sentence and instead focuses on their high-level semantic topics and activities they describe. This means that sentences need not share any common words – it is the concepts they describe upon which they will be evaluated.
Could such a tool be used to split television news broadcasts into their component stories?
A typical television evening news broadcast recounts multiple stories over its 30 minute duration, running from story to story to story. Unlike commercial breaks there are no black screens, audio silence or other clear cues to clearly delineate stories and stories may even be interwoven. Camera change detection is also of little use, since a single story may be told through a sequence of shots alternating between the studio and multiple field locations. This means segmenting a broadcast into stories requires actually semantically evaluating its contents.
In theory, the closed captioning of such a broadcast should exhibit high similarity during a given story and low similarity during the transition period between stories. Thus, taking each sentence of closed captioning during a broadcast and comparing it to the previous sentence should yield a similarity score from the API that is high during a given story segment and dips suddenly sharply low in between segments. The API's abstraction away from the literal wording of each sentence means that the fast-paced stream-of-consciousness narrative of broadcast television that poses such a challenge to traditional clustering approaches should not pose as much of an obstacle to the API.
To test this hypothesis, the closed captioning of January 15, 2020's ABC World News Tonight With David Muir from the Internet Archive's Television News Archive was broken into sentences and each sentence compared with the sentence immediately before (the "One" column), two sentences before (the "Two" column) and three sentences (the "Three" column) before using the Similarity API. To test slightly longer segments, each pair of two sentences were then compared with the two preceding sentences (the "Two-Two" column).
The final spreadsheet with all of the similarity scores can be downloaded here. Each row represents a single line of closed captioning, with the "Timecode" column recording its offset in seconds from the start of the broadcast and the URL column containing the URL of the clip on the Archive's website. The columns reflect the raw scores as output by the API and particularly sharp drops were manually highlighted with darker colors indicating a larger change. Note that due to a recording glitch in the original source video file, the timecodes reset on row 43, so the URLs to the video clips may be slightly off.
As predicted, similarity scores are very high during a story and drop sharply during the transition periods. Interestingly, drops in similarity scores during several of the stories reflect changes in narration and framing of the story, pivoting from Congress' impeachment actions to the backstory behind the impeachment.
The extremely promising results here suggest that with additional fine-tuning of the sentence inputs (such as using a larger rolling window) and adaptive thresholding, it should be possible to fully automatically segment arbitrary television news broadcasts into their component stories and even go a step further towards breaking stories into specific frames and narratives, all without any human intervention.
We're tremendously excited to explore this further!