Generative search is increasingly being portrayed as the future of how we search the web, in which embeddings and other mechanisms are used to retrieve search results and Large Language Models (LLMs) are used to then summarize those results for human consumption, blending and synthesizing their findings into a single cohesive narrative. In reality, these efforts often struggle to discern intent and distinguish between the same word used in different contexts, such as "comet" as an astronomical body or a bathroom cleaner. Hallucination has received a fair amount of attention, but an equally concerning area for enterprise use is plagiarism: just how much of the text created by these summarization engines is novel never-before-written narrative versus how much is simply copy-pasted verbatim from across the web, uncredited? A recent generative search for plant care illustrates that in some cases, almost the entire generative result may simply be plagiarized – raising the question of whether generative search LLMs are tuned for theft over creation or whether, in a web comprised of the collective output of much of the (at least Western) world, every possible sentence that could ever be written has already been published?
Needing to refresh my memory on caring for peace lilies recently, I ran a search on a major generative search engine for "how to care for peace lilies?" This yielded the reasonable-sounding summary:
Peace lilies need consistently moist soil, but they don't like to sit in standing water. When the top inch of soil is dry, water the plant until the overflow starts to come out of the bottom of the container. Wait until it stops draining, then set the pot back in its saucer. Peace lilies do well in low-light situations, even flowering with as little as two to four hours of sunlight per day. For best flowering, give them bright, indirect light (not direct sunlight). Peace lilies love humidity, so mist yours every few days or keep her in a humid room, like a kitchen or bathroom. Peace lilies do best in temperatures upwards of 21 degrees. Keep them away from cold drafts and windows during cold weather. Peace lilies don't need a ton of fertilizer. Feed with a balanced houseplant fertilizer, such as 20-20-20, every six to eight weeks during the spring and summer growing season. Deadhead spent blooms and dust or wipe the leaves regularly to ensure they photosynthesise well.
As I quickly skimmed the summary to assess its accuracy, I paused: the text seemed extremely familiar. I instantly recognized nearly verbatim passages from articles I had already read from sites like Gardener's World and House Beautiful, piquing my curiosity as to just how much of a generative search summary may simply be effectively copy-pasted from across the web, with only the most minimal of changes.
Let's take the summary above and break it into sentences (in a few cases we keep multiple sentences together when they appear as a verbatim passage across the web):
- Peace lilies need consistently moist soil, but they don't like to sit in standing water. When the top inch of soil is dry, water the plant until the overflow starts to come out of the bottom of the container. Wait until it stops draining, then set the pot back in its saucer.
- Peace lilies do well in low-light situations, even flowering with as little as two to four hours of sunlight per day.
- For best flowering, give them bright, indirect light (not direct sunlight).
- Peace lilies love humidity, so mist yours every few days or keep her in a humid room, like a kitchen or bathroom.
- Peace lilies do best in temperatures upwards of 21 degrees. Keep them away from cold drafts and windows during cold weather.
- Peace lilies don't need a ton of fertilizer. Feed with a balanced houseplant fertilizer, such as 20-20-20, every six to eight weeks during the spring and summer growing season.
- Deadhead spent blooms and dust or wipe the leaves regularly to ensure they photosynthesise well.
Now, for each passage, we'll use the same search engine that produced the generative summary to search for each sentence to see if that passage previously appeared anywhere across the web.
We use the same search engine to perform each search to demonstrate that the underlying search index has seen that specific passage before and thus it is likely to have been provided to the summarization LLM or to have potentially formed part of its training data. A passage might exist somewhere on the web, but not been seen by this specific search engine before, so by demonstrating that it exists in its index, we can argue that it likely formed either summarized search results, training data, or both.
Given the rapid spread of LLM-generated content across the web and the fact that the LLMs used to power generative search often share heritage or training data or architectural decisions with the LLMs used to generate web content, for each match we'll verify using the Internet Archive's Wayback Machine that the text appeared on the web in at least one place prior to the widespread public availability of ChatGPT on November 30, 2022. Simply verifying that a passage appears elsewhere on the web is inconclusive, since if the page was created post-November 30th, it could be partially LLM-generated and if the LLM used to create the page's text overlaps in any way with the LLM used to construct the generative search summary, the matching text could simply reflect the LLM producing the same output twice, rather than the LLM plagiarizing human-written text from the web.
Here is the final analysis, showing that all ten sentences were effectively copy-pasted from across the web, existing somewhere on the web prior to the general availability of ChatGPT and indexed into the search index of the underlying search engine, meaning they were likely available to the LLM either at training or inference time:
- Two-word difference from this page confirmed to July 2022: "Peace lilies need consistently moist soil, but they definitely don’t like to sit in standing water. Whenever the top inch of soil is dry, water the plant until the overflow starts to come out of the bottom of the container. Wait until it stops draining, then set the pot back in its saucer." In fact, this text appears largely verbatim across the web. Single-word difference of first sentence from MiracleGro: "Peace lilies need consistently moist soil, but they definitely don’t like to sit in standing water." Single-word difference confirmed to October 2020 of second sentence "Whenever the top inch of soil is dry, water the plant until the overflow starts to come out of the bottom of the container."
- Single-word difference from this page confirmed to July 2022: "Peace lilies do very well in low-light situations, even flowering with as little as two to four hours of sunlight per day."
- Verbatim confirmed to February 2022: "For best flowering, give them bright, indirect light (not direct sunlight)."
- Verbatim confirmed to May 2022: "Peace lilies love humidity, so mist yours every few days or keep her in a humid room, like a kitchen or bathroom." Note specifically that the generative search result preserves the use of the female pronoun "her" rather than referring to the plant in a neutral stance, amplifying the likelihood that the text was copied.
- One-word difference confirmed to August 2020: "They do best in temperatures upwards of 21°C." Nearly identical to confirmed July 2022: "Peace lilies prefer temperatures upwards of 21C so keep them away from cold drafts and windows, especially in the cooler months." Note specifically how the generative search used the same framing of "upwards of 21 degrees" – most text on the web uses a temperature range, with the reference to above 21 degrees very unique to this specific passage, with the confounding factor that the generative LLM did not attempt to convert 21 Celsius to the Fahrenheit scale in use in the user's identified country of the US, adding additional weight to the conclusion that this was plagiarized.
- Verbatim confirmed to February 2022: "don't need a ton of fertilizer. Feed with a balanced houseplant fertilizer, such as 20-20-20, every six to eight weeks during the spring and summer growing season." One-word difference confirmed to July 2022: "While peace lilies don’t need a ton of fertilizer,"
- Verbatim confirmed to June 2021 : "Deadhead spent blooms and dust or wipe the leaves regularly to ensure they photosynthesise well."
Let's compare the original and source text.
Here is the original text of each sentence as found on the web:
Peace lilies need consistently moist soil, but they definitely don’t like to sit in standing water. Whenever the top inch of soil is dry, water the plant until the overflow starts to come out of the bottom of the container. Wait until it stops draining, then set the pot back in its saucer. Peace lilies do very well in low-light situations, even flowering with as little as two to four hours of sunlight per day. For best flowering, give them bright, indirect light (not direct sunlight). Peace lilies love humidity, so mist yours every few days or keep her in a humid room, like a kitchen or bathroom. They do best in temperatures upwards of 21°C. Peace lilies prefer temperatures upwards of 21C so keep them away from cold drafts and windows, especially in the cooler months. They don't need a ton of fertilizer. Feed with a balanced houseplant fertilizer, such as 20-20-20, every six to eight weeks during the spring and summer growing season. While peace lilies don’t need a ton of fertilizer. Deadhead spent blooms and dust or wipe the leaves regularly to ensure they photosynthesise well.
And here is the generative summary:
Peace lilies need consistently moist soil, but they don't like to sit in standing water. When the top inch of soil is dry, water the plant until the overflow starts to come out of the bottom of the container. Wait until it stops draining, then set the pot back in its saucer. Peace lilies do well in low-light situations, even flowering with as little as two to four hours of sunlight per day. For best flowering, give them bright, indirect light (not direct sunlight). Peace lilies love humidity, so mist yours every few days or keep her in a humid room, like a kitchen or bathroom. Peace lilies do best in temperatures upwards of 21 degrees. Keep them away from cold drafts and windows during cold weather. Peace lilies don't need a ton of fertilizer. Feed with a balanced houseplant fertilizer, such as 20-20-20, every six to eight weeks during the spring and summer growing season. Deadhead spent blooms and dust or wipe the leaves regularly to ensure they photosynthesise well.
And here is a diff, with the original text on the left and the generative text on the right and the differences highlighted:
In essence, instead of writing its own advice in its own words on caring for peace lilies, the generative search tool merely copy-pasted ten sentences in seven passages from across the web that humans had written, changed a word here or there and presented as its own work. Of this 272-word passage, only 12 words were changed. If this had been submitted as a assignment by a human student, it would meet every criteria for willful plagiarism.
The near-total plagiarism of this generative search summary raises an existential question about the future of generative search and of LLMs for summarization as a whole: are LLMs plagiarizing from the web far more than was previously known or is it simply the case that every possible sentence that could ever be authored in English has already been written and published somewhere on the web such that any sentence written anywhere today could be found to overlap a sentence written by someone, somewhere, sometime on earth? The nearly verbatim copying of entire multi-sentence passages and the preservation of idiosyncrasies like gendering a plant and referring to a Celsius cutoff suggests the former, with enormous implications for the future of how companies deploy LLMs for summarization tasks.
One of the underlying assumptions of LLM-generated text is that it is sufficiently novel that a company can republish an LLM summary without fear of the legal risks of copyright infringement or the legal, regulatory and reputational risks of plagiarism. The results here demonstrate that those assumptions may not always be met and suggests companies should conduct in-depth sentence-by-sentence plagiarism examinations of their LLM summarization solutions over a range of applications to assess just how novel the LLM's created text truly is and whether they need to adjust temperature and other settings to create more randomness in output. Indeed, the most likely culprit in this case is that generative search applications are typically configured for exceptionally low creativity to minimize hallucination and misinterpretation. As companies adopt similar strategies for mission-critical applications, plagiarism will be a key metric companies will need to assess.