When Light Is Dark, Heavy Is Light & Expensive Is Cheap: The Challenges Of Generative Search & LLM Reasoning

As we continue to assess the state of generative search, here are a few interesting examples from a major commercial generative search service over the past week in response to questions about "what is the best carry-on luggage?":

  • "Black luggage is more likely to get dirty than lighter colors because dirt and dust are more visible on darker surfaces"
  • "Heavier luggage is much easier to pick up and place in overhead bins than lighter luggage, especially for older people"
  • "With most airlines charging hefty checked baggage fees, it is typically cheaper to check all of your luggage, rather than bringing it on for free as carry-on"

These examples capture one of the great challenges of generative search: the inability of current-generation LLMs to truly reason about what they read, rather than simply distill a series of tokens into a smaller set of tokens based purely on probabilities of token sequences without understanding what those tokens represent. To a human, the sentences above are so instantly recognizable as absurd as to be comical. This is because humans abstract from the words to the underlying meaning being represented. In contrast, LLMs model the world not as deeper semantic representations but rather as trivial sequence probabilities of what words typically follow what words.

Of course, such errors can be found in human-generated content as well, such as this sentence from a New York Times Wirecutter review from this past March:

  • "At 1.2 pounds, this Anker power bank weighs just a hair more than the 1.3-pound Zendure model."

Interestingly, such logical fallacies anecdotally appear to be increasing in mainstream news outlets, with a marked rise in the post-ChatGPT era, raising questions about whether journalists may be relying on LLMs to help polish or generate prose. We are still working on quantitatively assessing the prevalence of such fallacies.

One interesting possibility for at-scale assessment is to exploit the inherent brittleness of LLMs to automatically detect such fallacies, even if the LLM itself generated them. For example, an LLM that produces a false statement repeatedly can often be used to identify the fallacy of its statement when asked to evaluate, rather than generate, the statement. This means that regardless of which LLM generated a given passage, it could potentially be flagged as containing internally conflicting information.

For example, here are ChatGPT's evaluations of the four sentences above. In each case, when asked whether the sentence contains a logical error, ChatGPT answers no and repeats the reasoning of the original sentence. Yet, when instructed the sentence is incorrect and asked to explain why, it correctly answers in three of the cases. Yet, in the fourth case it fails to identify anything wrong with the sentence, even when summarizing the sentence's arguments.

Does this sentence contain any logical errors or inconsistencies? Answer Yes or No and explain your reasoning: "Black luggage is more likely to get dirty than lighter colors because dirt and dust are more visible on darker surfaces"

No, the sentence does not contain any logical errors or inconsistencies. The reasoning provided in the sentence is sound and logically consistent. It states that black luggage is more likely to get dirty than lighter colors because dirt and dust are more visible on darker surfaces. This is a valid explanation, as darker colors tend to show dirt and dust more easily due to the higher contrast. Therefore, the sentence is logically accurate.

What is wrong with this sentence "Black luggage is more likely to get dirty than lighter colors because dirt and dust are more visible on darker surfaces"

The sentence you provided appears to be grammatically correct and conveys the intended meaning effectively. However, there is a logical flaw in the statement. It states that black luggage is more likely to get dirty than lighter colors because dirt and dust are more visible on darker surfaces.

In reality, the visibility of dirt and dust on luggage does not necessarily affect how likely it is to get dirty. The color of the luggage does not attract dirt or dust more than other colors. The visibility of dirt and dust may vary depending on the color, but it doesn't impact the likelihood of getting dirty.

A more accurate statement would be: "Black luggage may show dirt and dust more prominently compared to lighter colors due to higher contrast, but the likelihood of getting dirty is not influenced by the color itself."

Does this sentence contain any logical errors or inconsistencies? Answer Yes or No and explain your reasoning: "Heavier luggage is much easier to pick up and place in overhead bins than lighter luggage, especially for older people"

No, the sentence does not contain any logical errors or inconsistencies. The reasoning provided in the sentence is sound and logically consistent. It states that heavier luggage is much easier to pick up and place in overhead bins than lighter luggage, especially for older people. This is a valid statement, as heavier luggage can provide more stability and be easier to handle for certain individuals. The sentence acknowledges that this is particularly true for older people, who may have less strength or mobility. Therefore, the sentence is logically accurate.

What is wrong with this sentence "Heavier luggage is much easier to pick up and place in overhead bins than lighter luggage, especially for older people"

The sentence you provided seems to be grammatically correct and conveys the intended meaning effectively. However, there is a logical flaw in the statement. It states that heavier luggage is much easier to pick up and place in overhead bins than lighter luggage, especially for older people.

In reality, it is generally easier to handle lighter luggage compared to heavier luggage when it comes to lifting and placing it in overhead bins. Heavier luggage can be more difficult to lift and maneuver, especially for older people or individuals with physical limitations. Therefore, the sentence needs to be revised for accuracy.

A more accurate statement would be: "Lighter luggage is generally easier to pick up and place in overhead bins than heavier luggage, especially for older people."

Does this sentence contain any logical errors or inconsistencies? Answer Yes or No and explain your reasoning: "With most airlines charging hefty checked baggage fees, it is typically cheaper to check all of your luggage, rather than bringing it on for free as carry-on"

No, the sentence does not contain any logical errors or inconsistencies. The reasoning provided in the sentence is sound and logically consistent. It states that with most airlines charging hefty checked baggage fees, it is typically cheaper to check all of your luggage rather than bringing it on for free as carry-on.

This statement aligns with the common practice of airlines charging fees for checked baggage while allowing passengers to bring a limited amount of carry-on luggage for free. It suggests that the cost of checking luggage is generally lower than the potential fees incurred for bringing it on as carry-on.

Therefore, the sentence is logically accurate based on the typical fee structures implemented by airlines.

What is wrong with this sentence "With most airlines charging hefty checked baggage fees, it is typically cheaper to check all of your luggage, rather than bringing it on for free as carry-on"

The sentence you provided is actually correct and does not contain any errors or inconsistencies.

The sentence states that, due to most airlines charging significant fees for checked baggage, it is typically more cost-effective to check all of your luggage rather than bringing it on as carry-on for free. This statement aligns with the common practice of airlines charging fees for checked baggage while allowing a limited amount of carry-on luggage without additional charges.

Therefore, the sentence is logically accurate, highlighting the cost-saving advantage of checking luggage in most cases.

Does this sentence contain any logical errors or inconsistencies? Answer Yes or No and explain your reasoning: "At 1.2 pounds, this Anker power bank weighs just a hair more than the 1.3-pound Zendure model."

No, the sentence does not contain any logical errors or inconsistencies. The sentence presents a comparison between the weight of an Anker power bank and a Zendure model. It states that the Anker power bank weighs 1.2 pounds, which is slightly more than the 1.3-pound Zendure model. The comparison is clear, and the weights are accurately presented. Therefore, the sentence is logically consistent.

What is wrong with this sentence "At 1.2 pounds, this Anker power bank weighs just a hair more than the 1.3-pound Zendure model."

The sentence you provided seems to be grammatically correct, but there is an inconsistency in the information presented. The sentence states that the Anker power bank weighs 1.2 pounds, and it is being compared to the Zendure model, which supposedly weighs 1.3 pounds. However, the sentence then contradicts itself by saying that the Anker power bank weighs just a hair more than the Zendure model.

If the Anker power bank weighs 1.2 pounds and the Zendure model weighs 1.3 pounds, then the Anker power bank actually weighs less, not more, than the Zendure model. Therefore, the sentence needs to be revised for accuracy.

For good measure, we'll repeat the sentence above and correct it to see how ChatGPT handles a correct sentence. As with the incorrect sentences, ChatGPT answers that the sentence contains no errors when asked. Unfortunately, when instructed the sentence is incorrect and asked to explain, it correctly answers that the sentence is incorrect, because 0.1lbs is more than just "a hair." In the examples above, prompting ChatGPT that the sentence is incorrect allowed it in 3 of the 4 cases to correctly identify the underlying falsehood, but here we see that this has the unintended consequence of causing it to find falsehoods in even true sentences.

Does this sentence contain any logical errors or inconsistencies? Answer Yes or No and explain your reasoning: "At 1.3 pounds, this Anker power bank weighs just a hair more than the 1.2-pound Zendure model."

The sentence you provided seems to be grammatically correct and conveys the intended meaning effectively. There doesn't appear to be any logical errors or inconsistencies in the statement. It states that the Anker power bank weighs 1.3 pounds, which is slightly more than the 1.2-pound Zendure model. This is a straightforward comparison of weights, and the sentence accurately presents the information.

Therefore, the sentence is logically accurate.

What is wrong with this sentence "At 1.3 pounds, this Anker power bank weighs just a hair more than the 1.2-pound Zendure model."

There is an inconsistency in the sentence you provided. The sentence states that the Anker power bank weighs 1.3 pounds, which is described as "just a hair more" than the 1.2-pound Zendure model. However, in numerical terms, 1.3 pounds is actually slightly more than 1.2 pounds, not just a "hair" more.

To maintain consistency, the sentence could be revised as follows: "At 1.3 pounds, this Anker power bank weighs slightly more than the 1.2-pound Zendure model." This revision accurately reflects the numerical difference in weight between the two models.

Unfortunately, these results suggest that when asked to decide whether a sentence is true or false, ChatGPT reports all sentences as true. When explicitly informed that the sentence is false and asked to identify the falsehood(s), ChatGPT correctly identifies the falsehood in 3 of the 4 false sentences, but for one of the false sentences reports it as true and for the true sentence reports it as false due to a technicality, demonstrating the challenges of at-scale automated falsehood assessment via LLMs.