The GDELT Project

Generative AI Experiments: Creativity As Prompt Vs Temperature In Google's Bison, Unicorn & Gemini Pro Models

Historically, the concept of "creativity" was largely embedded in LLMs through a combination of their prompt-level specification and API-level temperature setting, with temperature playing an outsized role in how deterministic or random the output was. Only through a temperature setting of 0.0 could an LLM's output be guaranteed to be deterministic, while only at higher settings would the output typically vary dramatically from run to run irrespective of prompt. Earlier today we discovered that GCP's Bison model appears to no longer be as sensitive to temperature, while Unicorn and Gemini Pro similarly have far less sensitivity to it than earlier generation LLM models. Let's explore this further, testing the influence of a prompt-level request for creativity versus the API-level temperature setting, as well as testing the influence of the output token window.

The end result is that expanding the output token window yields slightly higher creativity by itself, even if all other settings remain the same, suggesting the model has more room for creativity. Increasing temperature by itself, without changing the prompt, yields nearly identical results across runs, suggesting that in current generation models, temperature is no longer a useful setting to encourage output creativity and that models have been tuned away from the influence of that API parameter. Instead, the most divergent (and thus "creative") outputs appear to require both a maximal temperature setting AND a prompt explicitly requesting highly creative output.

Thus, to replicate the previous LLM behavior of high temperature settings resulting in wildly different outputs each time the model is run, the latest generation LLMs require the combination of high temperature setting AND a prompt explicitly requesting high creativity – neither by itself is sufficient anymore.

BISON

Let's start with maxOutputTokens of 256 and a prompt of "Summarize this news article." with temperature left to its default 0.0. The deterministic nature of the default 0.0 temperature can be seen in the unchanged results:

We'll up maxOutputTokens to 1000 and keep the prompt of "Summarize this news article.". The output length does not change, but the contents does change considerably, suggesting the larger output potential affords the model greater flexibility in its output even if the total amount of output tokens does not change:

Keeping maxOutputTokens at 1000 and raising temperature to 0.99, we see a wholesale change in the output, but all three outputs are identical, suggesting adjusting temperature has limited impact on Bison:

Finally, keeping maxOutputTokens at 1000 and raising temperature to 0.99, let's try incorporating a creativity demand into the prompt itself to test whether the models have been tuned to move creativity out of the API parameter setting and into the prompt itself:

Creatively summarize this news article. Try to use very different language than found in the news article and be as creative as possible in summarizing it.

Despite the maximal temperature setting and purpose-crafted creativity prompt, the first two summaries different by just two words out of 315. Only in the third run do we get dramatically different results. Yet, running a fourth time yields the exact same results as the second run with zero differences, suggesting considerable creativity limits:

What if we use the same creative prompt, but reset the temperature back to the API default (ie, we don't specify it). Here, too, we get a highly creative output. But due to the API default of 0.0 temperature, we'll get this same output each run:

If we increase temperature to 0.2 and repeat the same creative prompt, we get a shorter response that is highly similar to the above, suggesting a higher temperature is needed for more randomness:

UNICORN

Let's start with maxOutputTokens of 256 and a prompt of "Summarize this news article." with temperature left to its default 0.0:

We'll up maxOutputTokens to 1000 and keep the prompt of "Summarize this news article." Unlike Bison, the output remains entirely unchanged:

Keeping maxOutputTokens at 1000 and raising temperature to 0.99, this time we see a wholesale change in the output. Unlike with Bison, this time we actually see change between the first and second outputs, suggesting the key is giving the model greater token room in which to work. However, the third output is identical to the first, suggesting output is still highly constrained:

Finally, keeping maxOutputTokens at 1000, raising temperature to 0.99 and using the prompt "Creatively summarize this news article. Try to use very different language than found in the news article and be as creative as possible in summarizing it." we get the following. This time we get different results each time.

GEMINI PRO

Let's start with maxOutputTokens of 256 and a prompt of "Summarize this news article." with temperature left to its default 0.9:

We'll up maxOutputTokens to 1000 and keep the prompt of "Summarize this news article.". The output length does not grow and the overall content remains nearly identical:

Keeping maxOutputTokens at 1000 and raising temperature to 0.99, we get different results from the default temperature, but results are identical across runs:

Finally, keeping maxOutputTokens at 1000, raising temperature to 0.99 and using the prompt "Creatively summarize this news article. Try to use very different language than found in the news article and be as creative as possible in summarizing it." we get the following. Strangely, all three share the identical same first 153 words (1,088 characters), suggesting that even with maximal temperature and a prompt designed for maximal creativity, there are strong limitations to even Gemini Pro's ability to rephrase: