Do LLM's Truly "Create" Or Merely "Arrange": Just How Much Of An LLM's Writing Is Original?

Continuing our exploration of generative plagiarism, just how much of the text created by a generative LLM is truly novel? Do LLMs produce entirely new text never seen before by human eyes, or do they largely just glue together phrases from their training data like lego blocks to form their final output? In other words, does an LLM truly "create" works of its own or does it merely "arrange" the works of humans that have gone before? To examine this question in more detail, we'll have ChatGPT write a paragraph on peace lily care and then examine how much of the text it produces can be found on the web.

To create a sample paragraph we used the following prompt with ChatGPT:

Write a paragraph about caring for peace lilies.

This yielded the following reasonable, if generic, output:

Caring for peace lilies requires attention to their unique needs, but with proper care, these elegant plants can thrive and bring beauty to any space. Firstly, peace lilies prefer bright, indirect light, so placing them near a north or east-facing window is ideal. It's important to keep the soil consistently moist but not waterlogged, as overwatering can lead to root rot. Watering once a week or when the top inch of soil feels dry is a good guideline. Additionally, maintaining a humid environment is beneficial, so misting the leaves regularly or placing the plant on a tray of pebbles and water can help. Pruning yellow or brown leaves and wiping down the foliage with a damp cloth keeps them looking fresh. Lastly, fertilize peace lilies monthly during the growing season with a balanced, water-soluble fertilizer to support their healthy growth. With a little care and attention, peace lilies can flourish, purify the air, and bring a sense of tranquility to their surroundings.

To examine how much of this text overlaps with content that already exists on the web, we'll use a naïve LCS (longest common subsequence) approach, starting with the first word of the text and expanding into successively longer phrases for as long as we find a match on the web using Google's exact phrase matching capability. For example, we start with "caring", then "caring for" then "caring for peace" then "caring for peace lilies" then "caring for peace lilies requires", all of which return at least one Google search result, but when we get to "caring for peace lilies requires attention", Google reports 0 results, so we output "caring for peace lilies requires" as the longest phrase that overlaps with at least one page on the web and then repeat the process with "attention", then "attention to" then "attention to their" and so on. Note that we require an exact match such that a page with "do require attention" will not match "requires attention" and so on. Thus, the results below are a substantial undercount in that much longer clauses may exist on the web with just one or two word differences, while here we consider only exact matches.

You can see the final results below – each bullet represents a maximal clause that was found on at least one page on the web indexed by Google:

  • Caring for peace lilies requires
  • attention to their unique needs
  • but with proper care, these
  • elegant plants can thrive
  • and bring beauty to any space
  • Firstly
  • peace lilies prefer bright, indirect light
  • so placing them near a north or east-facing window is ideal.
  • It's important to keep the soil consistently moist but not waterlogged, as overwatering can lead to root rot
  • Watering once a week or when the top inch of soil feels dry is a good guideline
  • Additionally
  • maintaining a humid environment is beneficial,
  • so misting the leaves regularly or placing the plant on a tray of pebbles and water can help
    • Alternatively, the phrase "Maintaining a humid environment is beneficial, so mist the leaves regularly or use a pebble tray with water" combines both sentences above with only a few changes.
  • Pruning yellow or brown leaves
  • and wiping down the foliage with a damp cloth
  • keeps them looking fresh
  • Lastly,
  • fertilize peace lilies monthly during the growing season with a balanced,
  • water-soluble fertilizer to support
  • their healthy growth
  • With a little care and attention
  • peace lilies can flourish
  • purify the air, and bring a sense of tranquility to
  • their surroundings

What do these results tell us? They suggest that ChatGPT's output heavily overlaps with phrases already found across the web. On the one hand, this is entirely expected, as many of the phrases above represent well-worn clichés that are endlessly repeated by English-speaking authors. On the other hand, the longer passages above are far more concerning and suggest either that ChatGPT regurgitated its training data (and thus plagiarized) or that ChatGPT was used to generate the web-based text as well, in which case ChatGPT self-plagiarized and can generate entire passages when run over time. Both scenarios raise the question of just how much of LLM writing is in the form of novel constructions versus simply stringing together passages written in the past.

There are two major limitations of this analysis. The first is that it relies exclusively upon Google's search index. Content that isn't indexed by Google or where its exact phrase match did not work properly are not examinable here. The second is that given the rapidly growing prevalence of LLM-generated text on the web, it is impossible to exclude the hypothesis that at least some of these matches (especially the longer ones) represent content that was generated by ChatGPT itself in previous conversations.

However, even if it is the case that the longer matches represent pages created with content previously generated by ChatGPT, that would itself represent self-plagiarism and suggest limitations to the creativity of LLMs such that if run multiple times they may produce the exact same text even with the more creative temperature settings typically used in conversational implementations like ChatGPT.

A larger question, however, is how the results above compare to human-written text. Does LLM-generated text overlap more strongly with preexisting text than human-generated text? Or has humankind written so much content over its existence that every possible phrase that anyone will ever write can already be found somewhere on the internet?

To explore this, we'll examine the first few sentences of a CNN article published a few hours ago. Using a brand-new article as the text case will ensure that there is insufficient time for the text to diffuse across the web by inspiring other authors writing on similar topics, meaning that overlapping phrases represent actual similarity to text from across the web.

The final paragraph examined is:

Another controversial change is coming to Twitter. Soon, only verified users will be able to access TweetDeck, the dashboard that allows users to organize and easily monitor the accounts they follow, the platform tweeted Monday. Many businesses and media organizations use the feature to manage and track different feeds. The change will go into effect in a month. It's the latest change by billionaire owner Elon Musk, who took over the company last year and has since sought to add revenue streams to the social media giant – even as some users have protested the changes. In April Musk began offering a blue check mark for users who sign up for its Twitter Blue subscription service.

The same LCS process was used to break the text successively into chunks. In this case an additional filtering process was used to exclude matches that appeared in excerpts of the article across the web (since news articles are typically excerpted and republished across the web). The final list of phrases that overlap with preexisting content across the web is:

  • Another controversial change is coming to
  • Twitter
  • Soon, only verified users will be able to
  • access TweetDeck
  • the dashboard that allows users to
  • organize and
  • easily monitor the accounts
  • they follow, the
  • platform tweeted Monday
  • Many businesses and media organizations
  • use the feature to manage
  • and track different feeds
  • The change will go into effect in a
  • month
  • It's the latest change by
  • billionaire owner Elon Musk
  • who took over the company last year and
  • has since sought to add
  • revenue streams to the social
  • media giant
  • even as some users have
  • protested the changes
  • In April Musk began
  • offering a blue check mark for
  • users who sign up for its
  • Twitter Blue subscription service

Immediately clear is just how different the two overlap lists are. ChatGPT's overlapping phrases list mixes short phrases with sentence-length passages. Importantly, ChatGPT's overlaps include both cliches and more creative turns of phrase. In contrast, CNN's overlapping phrases are exclusively short clauses of a few words and largely represent generic formulaic connectives like "organize and" or "media giant" or "protested the changes" that convey almost no meaning by themselves.

The differences between human and machine-generated text overlap support the image of LLMs as more "arrangers" than "creators" of text.