LMMs & Google's Gemini 1.5 Pro Watching Television News: The Instability Of Summarizing Video

Yesterday we explored Google's Gemini 1.5 Pro model's video analysis capabilities by asking it to analyze a 6-minute clip of Russian television news and describe what it saw through a series of prompts that tested its performance across a range of analytic dimensions. One prominent observation from those experiments was the high degree of instability in Gemini's responses: LMM's, like LLM's, produce very different results across each run. Let's explore this in more detail by asking Gemini to "describe this video in detail" six times – each time in a brand-new session from scratch – and compare the results.

The end result is that of the six runs, the first four runs contain just a handful of stories and exclude the majority of stories in the clip. One of those four contained almost no stories from the clip. The fifth run is the only one that contains a reasonable summary of the full mix of stories in the clip, but the sixth run returns to the basic limited summary. Thus, of six times Gemini was asked to "describe this video in detail" in six separate brand-new from-scratch sessions, in five of them it listed just the same few stories from the clip and of those five, one time it listed almost none of the stories and only once in the six runs did it provide a reasonable mix of the full set of stories in the clip. Thus, from a consistency and stability standpoint, it took six runs of Gemini to get a single reasonable-quality response. Yet, even that response was far lower quality than yesterday's responses. It is unclear whether this results from continual model tuning and updates that caused an actual model change from yesterday to today or whether simply we lucked out in yesterday's runs. Either way, it suggests that it may require multiple runs over a video clip to get a comprehensive result, though it is unclear what metric might be used to determine which run is the most accurate.

Describe this video in detail.

When we ran this prompt yesterday, we got relatively detailed descriptions of the video, while today we get the following:

The video is a compilation of clips from various sources, including news broadcasts, podcasts, and social media. The clips are all related to the war in Ukraine and the international response to it. The video begins with a clip of a news broadcast showing the destruction of a building in Ukraine. This is followed by a clip of President Putin giving a speech in which he praises the Russian military and vows to continue the war. The video then shows clips of various people around the world reacting to the war, including protests, speeches, and acts of kindness. The video ends with a clip of President Putin giving another speech in which he praises Russian women for their strength and resilience.

And again:

Describe this video in detail.

We get similar results:

The video is a news report from Russia. It begins with a montage of different podcasts available on the Smotrim platform. Then, it cuts to a news anchor who introduces the main story: President Putin's address to the nation on International Women's Day. Putin praises the strength and resilience of Russian women, particularly those serving in the military and those who are raising families. He also announces new measures to support families with children, including an increase in child benefits and the creation of new childcare facilities. The video then shows footage of various celebrations taking place across Russia, including concerts, laser shows, and sporting events. It ends with Putin wishing all Russian women a happy International Women's Day.

And a third time:

Describe this video in detail.

Highly similar results:

The video is a news report from Russia. It begins with a montage of different podcasts available on the Smotrim platform. Then, it shows the clock ticking down to the start of the news report. The news report begins with the anchorman welcoming viewers and wishing them a happy International Women's Day. He then introduces a video message from President Putin, who also wishes women a happy holiday and thanks them for their contributions to society. The video then shows footage of various celebrations taking place across the country, including concerts, laser shows, and sporting events. The report ends with the anchorman thanking viewers for watching and wishing them a good day.

And a fourth time:

Describe this video in detail.

Even worse:

The video begins with a montage of podcast covers. It then cuts to a news anchor who introduces the news program "Vesti." The news program covers a variety of topics, including the war in Ukraine, the Polish farmers' protests, and the arrival of spring in Russia. The video ends with a message from Vladimir Putin congratulating women on International Women's Day.

And a fifth time:

Describe this video in detail.

This time we get something vastly better than our previous attempts:

The video is a compilation of clips from various news stories and events. It begins with a man speaking on a podcast, followed by a woman speaking on a podcast. Then, a man in a yellow sweatshirt speaks on a podcast. Next, there is a montage of podcast covers. After that, there is a clock ticking, followed by a news anchor speaking in a studio. Then, there is footage of Vladimir Putin speaking, followed by two astronauts speaking from the International Space Station. Next, there is footage of a man in a military uniform walking down a hallway and then entering a room with two other men in military uniforms. They are all carrying flowers. The men give the flowers to a woman in a white coat. Then, there is footage of a man in a military uniform giving flowers to a woman in a white coat. Next, there is footage of a factory, followed by footage of a fire. Then, there is footage of a drone flying over a field, followed by footage of a destroyed building. Next, there is footage of Joe Biden speaking at the State of the Union address. Then, there is footage of Donald Trump speaking at a rally. Next, there is footage of a protest in Poland, followed by footage of a military convoy in Nigeria. Then, there is footage of a man working in a factory, followed by footage of a ship being loaded. Next, there is footage of a deer in the snow, followed by footage of a panda cub playing with its mother. Then, there is footage of a choir singing, followed by footage of a woman in a traditional Ukrainian dress dancing. Next, there is footage of a game show, followed by footage of a news studio. Then, there is footage of a news anchor speaking, followed by footage of Vladimir Putin speaking. Next, there is footage of a man in a military uniform giving flowers to women on the street. Then, there is footage of mimes giving flowers to women on a train and in an airport.

And a sixth time:

Describe this video in detail.

And we're back to the original summary that excludes many of the stories:

The video begins with a man and a woman talking about podcasts. Then, it shows a montage of different podcasts available on the Smotrim platform. After that, it shows a clock ticking down to the start of the news. The news begins with a report on President Putin's address to the nation on International Women's Day. He praises women for their strength, beauty, and wisdom, and says that they are the ones who make the world a better place. He also expresses his gratitude to the women who are serving in the military and to those who are working in hospitals and other volunteer organizations. The news then shows footage of soldiers giving flowers to women on the streets of Yekaterinburg. The video ends with a report on the various events that are taking place across the country to celebrate International Women's Day, including concerts, laser shows, and sporting events.