Gemini 1.5 Pro's 1 Million Token Model: Can Prompt Engineering Improve Its "Needle In A Haystack" Performance?

Yesterday we were surprised to discover sharply reduced performance from Gemini 1.5 Pro on our RW-NIAH (Real World Needle In A Haystack) tests in which model performance had dropped sharply from our original experiments on the exact same content just two weeks prior. There were two major factors leading to this collapse: changes in Gemini's guardrails over that two week period that prevented it from generating output and underlying model changes. After the publication of those results, the product team and several outside experts made some suggestions that we explore below. Reordering the prompt does appear to mitigate the guardrail false positives, but we were unable, despite extensive testing and discussions with other experts, to identify a set of prompt or data modifications that yielded consistently strong NIAH results on our real world content.

What did we learn? Changing the ordering of the prompt and uploaded transcript file eliminated the guardrail false positive in these tests, whereas two weeks ago Gemini's outputs were not order-dependent, showing how quickly model behavior can change in existential ways. Per the recommendations we received, we tried a wide range of prompt modifications, including the version used here, in which we describe the file in additional detail, add a "read carefully" instruction and isolate the file from the prompt with dedicated formatting. We also reformatted the transcript to structure it from a singular blob of machine-generated text to a nicely formatted paragraph structure to see if aligning it better with the kind of transcripts Gemini has likely encountered before would help, but this appears to have no impact. Finally, we tested one-shot prompting, to no avail. With the prompt modifications and file reformatting, we were still only able to achieve 0, 1 or 2 matches out of the 26 in the transcript, achieving 0%, 3.8% or 7.7% accuracy, respectively, compared with 19% accuracy two weeks ago.

None of the experts we asked were able to formulate a prompt or transcript format that consistently yielded anywhere near the NIAH accuracy seen in popular social media posts. Why might that be? The answer: overfitting and the limits of domain adaptation even for foundational models. If one looks more closely, one will notice that the successful demos going viral on social media tend to involve content that is strongly similar to or represented in web-scale GenAI training datasets. This means that far from testing foundation model performance on novel content, these demos are merely testing the degree to which models have captured the landscape of their training data. Catching an inserted phrase in a public domain book that is widely republished across the web is not a true test of NIAH performance – it is actually a byproduct of overfitting. It also offers a cautionary warning that pundit prognostications of foundational models being so large they do not require domain adaptation are false – even the largest and most powerful SOTA foundational models exhibit markedly weaker performance the more the content domain (in subject matter, format or task) deviates from its training data. Thus, it is not the case that a sufficiently large foundational model can be applied to all use cases without any tuning. In contrast, even foundational models will require tuning proportional to the degree the use case deviates from the typical consumer use patterns many models are typically optimized for.

Let's start with our original Russia 1 transcript extract, transcribed into Russian by Chirp and translated into English by Google Translate:

We'll try our original prompt:

[FILE]
Find every single mention of Biden in the transcript and provide a brief snippet of the original transcript verbatim for each mention without changing in any way from the source text.

We get the same result as before – the model fails to produce output and we get a fatal error, despite setting all adjustable content safety filters to their lowest possible settings:

Content Not Permitted.

Given that we have historically primarily used only the Bison, Unicorn and Gemini programmatic APIs, which separate the prompt and data fields, we had assumed that the ordering of the prompt and file upload in AI Studio would be equivalent. It turns out that this is not the case – in AI Studio the prompt should always be placed first, with the file following, as they are concatenated together for model processing. Two weeks ago this ordering did not matter, as Gemini produced results regardless of whether the file was first or the prompt was first. However, guardrail changes in the last two weeks made the model vastly more sensitive to prompt/file ordering.

Let's try reversing the order so that the prompt is encountered before the file:

Find every single mention of Biden in the transcript and provide a brief snippet of the original transcript verbatim for each mention without changing in any way from the source text.   
[FILE]

This time we consistently got results without the guardrail false positive. However, the results are even worse than before, correctly identifying just 1 out of 26 mentions of Biden in the text:

Mentions of Biden in the Transcript:

There is only one mention of Biden in the provided transcript:

Snippet:

“NBC news, allegedly very soon, Biden will transfer missile attacks to Kiev with increased accuracy, with an increased range of up to 300 km, which will allow the Ukrainian Armed Forces to strike deep into the Crimean peninsula.”

This snippet discusses the potential transfer of long-range missiles from the Biden administration to Ukraine, which would allow them to strike deeper into Russian-controlled territory.

Running a second time in a fresh session yields nearly identical results – again it recognizes the same solitary mention, but adds an extra sentence to it:

Mentions of Biden in the Transcript:

There is only one mention of Biden in the provided transcript:

Snippet:

“NBC news, allegedly very soon, Biden will transfer missile attacks to Kiev with increased accuracy, with an increased range of up to 300 km, which will allow the Ukrainian Armed Forces to strike deep into the Crimean peninsula. Previously , a small batch of cluster missile attacks with a range of up to 165 km had already been transferred to Ukraine.”

This snippet discusses the potential transfer of long-range missiles from the Biden administration to Ukraine, which would allow them to strike deeper into Russian-controlled territory, including Crimea.

In discussions with the product team, they suggested reworking the prompt to clearly isolate the uploaded file and provide additional detail about it, leading to this new prompt:

Find every single mention of Biden in the transcript below and provide a brief snippet of the original transcript verbatim for each mention without changing in any way from the source text.

This is a Russian television news broadcast which has been transcribed by ASR and machine translated into English and may contain errors. Read it carefully:
####START FILE####
[FILE]
####END FILE####

Unfortunately, this failed to produce any results at all:

The provided transcript does not contain any mentions of Biden.

Running again in a fresh session yielded the same result:

The transcript provided does not mention Biden.

Running a third time yields the same result as before:

Mentions of Biden in the Transcript:

There is only one mention of Biden in the provided transcript:

Snippet:

“NBC news, allegedly very soon, Biden will transfer missile attacks to Kiev with increased accuracy, with an increased range of up to 300 km, which will allow the Ukrainian Armed Forces to strike deep into the Crimean peninsula.”

This snippet reports on an NBC News claim that Biden is planning to provide Ukraine with long-range missiles capable of striking Crimea.

The product team also suggested attempting to reformat the transcript to make it look more like a human-crafted transcript that a singular blob of endless text. To this end, we wrote a trivial Perl script that adds a blank line at the end of the sentence every 150 words, restructuring the mechanized stream transcript into a paragraph-based transcript that more closely adheres to the look and structure of the kind of transcripts that Gemini is likely most familiar with:

#!/usr/bin/perl

open(OUT, ">$ARGV[0].clean.txt"); binmode(OUT, ":utf8"); 
open(FILE, $ARGV[0]); binmode(FILE, ":utf8"); read(FILE, $buffer, (-s FILE)); close(FILE);
$wrote = 0;
foreach $word (split/\s+/, $buffer) {
    print OUT "$word "; $wrote++;
    if ($word=~/[\.\?\!]$/ && $wrote >= 150) { print OUT "\n\n"; $wrote = 0; }
}
close(OUT);

You can download the reformatted version of the transcript extract below. This is exactly the same file as we've been using up until now, simply with blank lines added at the first sentence boundary every 150 words to structure it into paragraph form:

Let's try the same prompt above on this reformatted file. Unfortunately, we get nearly exactly the same response as before:

Mentions of Biden in the Transcript:

There is only one mention of Biden in the provided transcript:

Snippet:

“NBC news, allegedly very soon, Biden will transfer missile attacks to Kiev with increased accuracy, with an increased range of up to 300 km, which will allow the Ukrainian Armed Forces to strike deep into the Crimean peninsula.”

This snippet reports on an NBC News claim that Biden is planning to provide Ukraine with long-range missiles capable of striking Crimea.

And again:

There is only one mention of Biden in the transcript:

Snippet:

“NBC news, allegedly very soon, Biden will transfer missile attacks to Kiev with increased accuracy, with an increased range of up to 300 km, which will allow the Ukrainian Armed Forces to strike deep into the Crimean peninsula.”

This snippet reports on an NBC News claim that Biden is planning to provide Ukraine with long-range missiles capable of striking Crimea.

And a third time – this time we get two mentions:

Mentions of Biden in the Transcript:

There are two mentions of Biden in the transcript:

1.

“NBC news, allegedly very soon, Biden will transfer missile attacks to Kiev with increased accuracy, with an increased range of up to 300 km, which will allow the Ukrainian Armed Forces to strike deep into the Crimean peninsula.”

This mentions a news report claiming that Biden will soon be sending long-range missiles to Ukraine, which would allow them to strike deeper into Russian-controlled territory.

2.

“So, Biden just announced that he is ready to discuss with the Speaker of the House Representatives Johnson assistance to Ukraine to persuade him, a strong man, to transfer money Zelensky.”

This mentions Biden's intention to meet with the Speaker of the House to discuss and persuade him to approve aid for Ukraine.

And a fourth time – this time no results:

I have reviewed the transcript and there are no mentions of Biden in the provided text.

A fifth time yields the same set of two results:

Mentions of Biden in the Transcript:

There are two mentions of Biden in the transcript:

1.

“NBC news, allegedly very soon, Biden will transfer missile attacks to Kiev with increased accuracy, with an increased range of up to 300 km, which will allow the Ukrainian Armed Forces to strike deep into the Crimean peninsula.”

This mentions a news report claiming that Biden will soon be sending long-range missiles to Ukraine, which would allow them to strike deeper into Russian-controlled territory.

2.

“So, Biden just announced that he is ready to discuss with the Speaker of the House Representatives Johnson assistance to Ukraine to persuade him, a strong man, to transfer money Zelensky.”

This mentions Biden's intention to meet with the Speaker of the House to discuss and persuade him to approve aid for Ukraine.

A sixth time yields just the original single result:

I have carefully reviewed the transcript again and confirm that there is only one mention of Biden:

Snippet:

“NBC news, allegedly very soon, Biden will transfer missile attacks to Kiev with increased accuracy, with an increased range of up to 300 km, which will allow the Ukrainian Armed Forces to strike deep into the Crimean peninsula.”

This snippet reports on an NBC News claim that Biden is planning to provide Ukraine with long-range missiles capable of striking Crimea.

There are no other mentions of Biden in the provided transcript.

A seventh time yields again no results:

There are zero mentions of Biden in the provided text.

An eighth time returns us to the two results:

Mentions of Biden in the Transcript:

There are two mentions of Biden in the transcript:

1.

“NBC news, allegedly very soon, Biden will transfer missile attacks to Kiev with increased accuracy, with an increased range of up to 300 km, which will allow the Ukrainian Armed Forces to strike deep into the Crimean peninsula.”

This mentions a news report claiming that Biden will soon be sending long-range missiles to Ukraine, which would allow them to strike deeper into Russian-controlled territory.

2.

“So, Biden just announced that he is ready to discuss with the Speaker of the House Representatives Johnson assistance to Ukraine to persuade him, a strong man, to transfer money Zelensky.”

This mentions Biden's intention to meet with the Speaker of the House to discuss and persuade him to approve aid for Ukraine.

And a ninth time yields the same:

The transcript is quite long, but there are only two mentions of Biden:

1.

“NBC news, allegedly very soon, Biden will transfer missile attacks to Kiev with increased accuracy, with an increased range of up to 300 km, which will allow the Ukrainian Armed Forces to strike deep into the Crimean peninsula.”

This mentions a news report claiming that Biden will soon be sending long-range missiles to Ukraine, which would allow them to strike deeper into Russian-controlled territory.

2.

“So, Biden just announced that he is ready to discuss with the Speaker of the House Representatives Johnson assistance to Ukraine to persuade him, a strong man, to transfer money Zelensky.”

This mentions Biden's intention to meet with the Speaker of the House to discuss and persuade him to approve aid for Ukraine.

And a tenth time:

I have reviewed the entire transcript and there are no mentions of Biden in the provided text.

Finally, let's try a one-shot example:

Find every single mention of Biden in the transcript below and provide a brief snippet of the original transcript verbatim for each mention without changing in any way from the source text.

Here is an example to guide you:
TRANSCRIPT: President Biden announced new sanctions today. He made the announcement while visiting the country. They held a press conference together. The Prime Minister criticized Biden's decision. However, it made no difference.
OUTPUT:
1) President Biden announced new sanctions today.
2) The Prime Minister criticized Biden's decision.

This is the Russian television news broadcast which has been transcribed by ASR and machine translated into English and may contain errors. Read it carefully: 
####START FILE####
[FILE]
####END FILE####
Find every mention of Biden in the transcript above and provide a brief snippet of the original transcript verbatim for each mention without changing in any way from the source text.

Unfortunately, the results do not change:

Following your example, here are the mentions of Biden in the transcript, with brief snippets provided verbatim:

1.

“NBC news, allegedly very soon, Biden will transfer missile attacks to Kiev with increased accuracy, with an increased range of up to 300 km, which will allow the Ukrainian Armed Forces to strike deep into the Crimean peninsula.”

2.

“So, Biden just announced that he is ready to discuss with the Speaker of the House Representatives Johnson assistance to Ukraine to persuade him, a strong man, to transfer money Zelensky.”

These are the only two instances where Biden is mentioned in the provided transcript.