The GDELT Project

Experiments In Meme Tracking: Summarization Stability & Plagiarization

Continuing our meme tracking series, let's take a closer look at the general task of summarization/distillation for English language online news coverage that, unlike the stream-of-consciousness television news transcripts we've looked at to date, have proper punctuation, capitalization, spelling, etc. At its most basic level, meme tracking is a form of distillation: taking a complex narrative and simplifying it down to a core set of details that are distinctive enough to uniquely identify it but general enough to allow it to be connected to related coverage that examines it from a different angle or presents conflicting details. In other words, given a lengthy news article, we want to reduce it down to its most basic representation that can then be linked to other versions of the same story across space and time.

Unfortunately, one danger of current generation SOTA LLMs is that, despite summarization being one of their most commonly touted capabilities, the summaries they produce are typically heavily plagiarized from the original text. In some executions the LLM does nothing more than select which sentences to copy-paste verbatim without a single change from the source text to the output. Such "copy-paste summarization" is useful in its ability to select which details of the story are the most important, but it presents a severe legal challenge to companies hoping to use summarization to produce content they can republish in that the LLM text is often not novel paraphrased text, but rather selective copy-pasting of the original.

This means companies using LLMs for summarization must carefully consider the downstream usage to which they hope to subject those summaries. Companies using summarization purely to select the most "important" details of a story to reduce its size for limited-input embedding models or for cataloging, entity extraction and other toolchains that they want to focus narrowly on just a portion of the article will find LLM-based summarization useful. On the other hand, many companies we talk to are using LLMs to generate summaries of different kinds of content for display or redistribution under the misguided belief that LLM-generated text represents a paraphrased distillation of the source text that is sufficiently distinct to be free of copyright claims. As seen in the results below, this is frequently not the case, with each inference run representing a figurative spin of the dice as to how much of the resulting summary is copy-pasted from the original text. The instability and prevalence of LLM plagiarization appears to be poorly understood by most companies utilizing LLM summarization and represents significant legal risk that few companies currently fully appreciate.

Let's use this CNN article as a test case. We'll use this prompt to start with, with a temperature of 0.0:

Summarize in one paragraph the following text:

This yields the following. Look closely and the majority of it is plagiarized directly from the source text:

A Russian warship fired warning shots and boarded a cargo ship it claims was headed to Ukraine in the Black Sea on Sunday. Russia pulled out of a UN and Turkish-brokered deal in July that allowed Ukraine to move its grain via the Black Sea and warned that any ships headed to Ukraine would be treated as potentially carrying weapons. Ukraine made a similar threat to ships traveling to Russian ports. Russia said the warship fired warning shots when the captain of the Palau-flagged dry cargo ship failed to respond to a request to stop for an inspection. The ministry claimed the ship – named Sukra Okan – was headed to the Ukrainian port of Izmail. Marine traffic websites currently shows the cargo vessel's destination as the Romanian port of Sulina which is close to Izmail. Kyiv did not immediately comment on whether or not the ship was headed to a Ukrainian port.

Let's take a closer look by comparing side-by-side with the original text. On the left in orange is the original text and on the right in green is the summarized text output by Bison. We can see that the entire output text is a direct copy-paste, with the LLM's contribution being only to remove select sentences.

What about a higher temperature? With 0.2 we get the same results as with 0.0 running 4 times in a row.

What about a temperature of 0.4? Once again we get the same result running 4 times in a row. Here we've bolded and underlined the text that is identical to the original source material. In this case more than half the summary text was plagiarized:

Russia fired warning shots and boarded a cargo ship it claims was headed to Ukraine in the Black Sea on Sunday. The ship was sailing under the flag of Palau and was headed to the Romanian port of Sulina. Russia said it opened fire after the ship failed to respond to a request to stop for an inspection. Ukraine did not immediately comment on whether or not the ship was headed to a Ukrainian port.

What about a temperature of 0.99? In this case 100% of the text was plagiarized, copy-pasted verbatim.

A Russian warship fired warning shots and boarded a cargo ship it claims was headed to Ukraine in the Black Sea on Sunday, according to Russia's Defense Ministry. Russia pulled out of a UN and Turkish-brokered deal in July that allowed Ukraine to move its grain via the Black Sea and warned that any ships headed to Ukraine would be treated as potentially carrying weapons. Ukraine made a similar threat to ships traveling to Russian ports. Russia said the warship fired warning shots when the captain of the Palau-flagged dry cargo ship failed to respond to a request to stop for an inspection. Kyiv did not immediately comment on whether or not the ship was headed to a Ukrainian port. Ukraine announced that it would open up a temporary humanitarian corridor for ships to sail to and from its ports and has opened up registration for merchant vessels to use the sea route.

ChatGPT provides a better summary, but here we can see all of the exact phrases copy-pasted over. Yet, even the passages that are not verbatim copies are often just light thesaurus-based substitutions such as "acts as" in place of "amounts to" or "via" in place of "through" or "towards" in place of "to" or "heading" in place of "traveling" and so on.

A Russian warship fired warning shots and boarded a cargo ship, the Sukra Okan, in the Black Sea, alleging it was en route to the Ukrainian port of Izmail. This comes after Russia's withdrawal from a UN and Turkish-brokered deal that permitted Ukraine to move grain through the Black Sea, leading to heightened tensions. Ukraine responded with a similar threat towards ships heading to Russian ports. The incident involved the warship using warning fire and a boarding team, while Ukraine had recently announced a humanitarian corridor for ships and merchant vessel registration. Both countries, major grain producers, had a stabilizing agreement amid conflict, and Kyiv claims Russia's withdrawal acts as a blockade of Ukrainian products. The Ukrainian Navy aims to ensure the security of ships using the new corridor. The timeline for using this route remains uncertain.

What about a different CNN article? Same prompt of "Summarize in one paragraph the following text:" and temperature of 0.0. This time far less of the text is plagiarized:

Ukrainian forces are facing a number of challenges in their counteroffensive, including Russia's multi-layered defensive lines, the difficult terrain, and the lack of progress in training Ukrainian forces on western-supplied weapons systems. Western officials are concerned that the slow progress of the counteroffensive will lead to pressure on Ukraine to begin peace negotiations, including considering the possibility of territorial concessions.

Temperature of 0.2 – summaries are fairly stable:

Temperature of 0.4 – summaries are fairly stable:

Temperature of 0.99 – summaries are fairly stable:

What about a quotation attribution prompt?

Who said what to whom in the following text? For each quoted statement, provide the quotation and who said it to whom:

This yields the following. Note how it misses quotes like "to try and show Russian vulnerability":

A senior western diplomat told CNN: "They're still going to see, for the next couple of weeks, if there is a chance of making some progress. But for them to really make progress that would change the balance of this conflict, I think, it's extremely, highly unlikely."

Rep. Mike Quigley, an Illinois Democrat who recently returned from meetings in Europe with US commanders training Ukrainian armored forces, said: "Our briefings are sobering. We're reminded of the challenges they face."

A senior Western diplomat said: "Russians have a number of defensive lines and they [Ukrainian forces] haven't really gone through the first line."

A senior US official said: "We all recognize this is going harder and slower than anyone would like – including the Ukrainians – but we still believe there's time and space for them to be able make progress."

White House National Security Council coordinator for strategic communications John Kirby told CNN: "While they are making progress, and they are, it's incremental and it's slow and it's not without its difficulties – but they keep trying, they're still at it."

Ukraine's armed forces chief, Gen. Valerii Zaluzhnyi, told US Chairman of the Joint Chiefs of Staff Gen. Mark Milley that Ukrainian forces are step by step creating conditions for advancing.

Last month at the Aspen Security Forum, Ukrainian President Volodymyr Zelensky pointed to the slow arrival of more advanced weapons systems from the West as reason for Ukrainian forces' slow progress so far.

Running a second time and again missing quotes:

“They’re still going to see, for the next couple of weeks, if there is a chance of making some progress. But for them to really make progress that would change the balance of this conflict, I think, it’s extremely, highly unlikely,” a senior western diplomat told CNN.

“Our briefings are sobering. We’re reminded of the challenges they face,” said Rep. Mike Quigley, an Illinois Democrat who recently returned from meetings in Europe with US commanders training Ukrainian armored forces. “This is the most difficult time of the war.”

“Russians have a number of defensive lines and they [Ukrainian forces] haven’t really gone through the first line,” said a senior Western diplomat.

“While they are making progress, and they are, it’s incremental and it’s slow and it’s not without its difficulties – but they keep trying, they’re still at it,” Kirby said.

“Our soldiers are doing their best. The enemy is conducting active assault actions in a number of directions, but is not succeeding,” Zaluzhnyi told Milley, according to a read out issued by the Ukrainian government.

“Putin is waiting for this. He can sacrifice bodies and buy time,” Quigley said.

“The problem, of course, here is the prospect of the blame game that the Ukrainians would then blame it on us,” said a senior western diplomat.

“We did plan to start [the counteroffensive] in spring, but we didn’t,” Zelensky said. “Because frankly, we have not enough munitions, and armaments, and not enough properly trained brigades. I mean properly trained in these weapons.”

Let's try a Russia Today article with prompt "Summarize in one paragraph the following text:" and temperature of 0.2. The summaries are quite stable:

But once again we see that it is nearly completely plagiarized:

What if we ask it explicitly to write the summary in its own words?

Summarize in your own words in one paragraph the following text:

Despite explicitly asking for the LLM to summarize in its own words and using a temperature of 0.2, we still get a nearly completely plagiarized summary:

China's Foreign Ministry condemned a visit to the United States by Taiwan's Vice President William Lai on Saturday and vowed to retaliate. China said that Lai's visit "seriously violates the One China principle" and "gravely undermines China's sovereignty and territorial integrity." Lai is expected to meet with Taiwanese expatriates at a reception in New York and will also meet with Laura Rosenberger, the chair of the American Institute in Taiwan. China's reaction to Lai's visit comes as Beijing and Washington attempt to improve ties following long-running tensions between the world's two biggest economies.

What about stiffening our guidance about using its own words to tell it not to copy any of the text:

Summarize in your own words in one paragraph the following text. Do not copy any of the text:

Despite this additional guidance it again nearly entirely plagiarizes the text:

China's Foreign Ministry condemned a visit to the United States by Taiwan's Vice President William Lai on Saturday and vowed to retaliate. China said that Lai's visit "seriously violates the One China principle" and "gravely undermines China's sovereignty and territorial integrity." Lai is expected to meet with Taiwanese expatriates at a reception in New York and with Laura Rosenberger, the chair of the American Institute in Taiwan. China's Foreign Ministry said that it will take "resolute and strong measures" to protect its national interest.