Earlier today we unveiled a glimpse of Prompt Evolver, our internal workflow for fully unsupervised autonomous LLM prompt engineering and optimization. One of the more intriguing findings from the trivial scaffolding experiment we used to showcase the workflow (and from our own internal work this year) is just how similar LLM-based summaries are, no matter how divergent their instructive prompts.
Given the same text and using a state-of-the-art model like GPT 4.0, the difference between the summaries generated by "Summarize in 2-3 sentences" and "As an analytical thinker, summarize the following text into 2-3 sentences, capturing all key details. Use reasoning to ensure the summary is concise, accurate, and does not include external information" is nearly imperceptible. Run again and again, even with maximal temperature and other model settings designed to encourage maximum randomness/creativity, summarization tasks are highly predictable, appearing almost templated.
Far from the kind of creative paraphrasing and interpretive reworking utilized by human summarizers, LLM summarization using the most advanced models and all recommended prompting strategies still tends to produce adlib-like summaries that would likely be treated as plagiarism if submitted by a human writer.
While useful for condensing large texts into smaller chunks, there are important unresolved questions about whether advanced LLM summaries are substantially better than classical SIP (statistically improbable phrase) and TFIDF-based clause selectors with statistical rewriters. The latter has the added benefit of no hallucination or meaning changing.