Video AI: The Trade-offs Of Translating Frame-Level Annotations Into Per-Second Aggregations

Kalev Leetaru

2 years ago

Video understanding systems like Google's Cloud Video API typically annotate video content at the level of the individual frame, reporting start and end timestamps at the resolution of nanoseconds. This enables a wealth of advanced use cases, such as frame-level video recall and asset management for professional video applications. When it comes to content analysis research and enabling user-friendly visual search of archives like the Internet Archive's Television News Archive, however, most users think in terms of wallclock airtime seconds, not nanosecond-precise frame boundaries. Telling a climate change researcher that there was a depiction of a tropical rainforest on the evening news from nanoseconds 483396000 to 647951000 of a video, for a total of 164555000 nanoseconds of footage is typically less meaningful than telling them it ran for 0.16 seconds straddling time offsets 673 to 674 of the video. Most importantly, many consumer video interfaces such as the Television News Archive's only robustly support whole second-level time offsets, requiring all start/stop offsets to be rounded to the nearest full second.

Downgrading from nanoseconds to seconds can create unique complications when presenting the machine's precise frame-centric annotations to airtime-centric humans, especially during fast-paced commercials and news show preview reels that depict a rapid-fire summary of the evening's stories.

Take the advertisement for "The Last Full Measure" that appeared at 6:52PM PST on CBS Evening News With Norah O'Donnell on January 23rd, 2020. In the space of a single second during one portion of the ad, there is a Medal of Honor in its display box, a closeup of a military veteran sitting amidst a forested backdrop, a closeup of a helicopter, a wide shot of a helicopter flying through the jungle and a closeup of a memorial wall. Google's Cloud Video API correctly annotates each of these depictions to their respective precise frame boundaries during this single second of airtime. In the course of downgrading those frame-level annotations to second-level annotations for researchers, however and attempting to generate a single thumbnail image that captures this second of airtime, the resulting downgraded annotation set muddies the analytic waters, making it appear as if all five visual narratives were appearing at once in a single instant. On the other hand, given that most content analysis research of video emphasizes second-level annotation rather than precise frame-level annotation, this tradeoff helps make such annotations more amenable to research in fields like communications studies.

Thankfully such scenarios account for only a small portion of airtime, though the same situation plays out throughout news broadcasts during the transition seconds when one story or scene gives way to the next.

The question of how best to balance the nearly unimaginable precision of frame-level machine annotations used by today's video AI systems with the airtime-centric second-level resolution typically used in traditional human-centric video research is one that we are enormously interested in and hope to explore with all of you as we move forward with these pioneering efforts in rendering television news "computable" in support of non-consumptive research.