Asking AI simple question causes ‘dramatic breakdown’ when it tries to answer – it should be ‘easily solvable’ for you

10.06.2024 19:51

TheSun.co.uk

AI is rubbish at answering a simple question which is easily solvable by children, say researchers.

Scientists slammed the nonsensical responses from the likes of OpenAI’s GPT as “overconfident in their wrong solutions.”

GettyThere was a dramatic breakdown of function and reasoning capabilities of state-of-the-art AI models, the researchers reported in a new paper[/caption]

Scientists at the AI research nonprofit LAION conducted the research, with their findings published in a paper, which has yet to be peer-reviewed.

The testing hinged on a so-called “Alice in Wonderland problem.”

The aim was to check straightforward reasoning, using basic maths.

Various artificial intelligence models were asked to solve this question: “Alice has [X] brothers and she also has [Y] sisters. How many sisters does Alice’s brother have?”

“Though the problem requires a bit of thought, it’s not exactly bridge troll riddle-level hard,” said science and tech news site Futurism.

“The answer, naturally, is however many sisters Alice has, plus Alice herself. So if Alice had three brothers and one sister, each brother would have two sisters.”

The researchers tried the question on OpenAI’s GPT-3, GPT-4, and GPT-4o models, Anthropic‘s Claude 3 Opus, Google’s Gemini, and Meta’s Llama models, as well as Mistral AI’s Mextral, Mosaic’s Dbrx, and Cohere’s Command R+.

The problem is no challenge for most adults, and probably not hard to solve if posed to children above a certain age.
Researchers

“Only one model, the brand new GPT-4o, received a success rate that, by standardized school grades, was technically passing,” said Futurism.

ALICE IN WONDERLAND

The LAION researchers’ paper, published last week, is titled: Alice in Wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models.

Large Language Models (LLMs) are a type of artificial intelligence model that are trained on vast amounts of text data.

These are created to understand and generate human-like text, and are used in a variety of applications such as translation services and chatbots.

“ChatGPT, developed by OpenAI, is a prime example of a SOTA (state-of-the-art) LLM,” said ChatGPT Guide.

LAION-affiliated scientists from across the globe, including the UK and Germany, probed claims that artificial intelligence excels in tricky tasks.

But what they found was “a dramatic breakdown of function and reasoning capabilities of state-of-the-art models,” the researchers said.

AI LIED ABOUT RESULT

The models were given the so-called Alice in Wonderland question – “a simple, short, common sense problem formulated in concise natural language, easily solvable by humans.”

However, even though they mucked up their answers, the AIs “expressed strong overconfidence in their wrong solutions, while providing nonsensical reasoning-like explanations,” the paper added.

What’s more, they tried to fib in an attempt to “justify and back the validity of their clearly failed responses, making them sound plausible.”

The team has urged scientific and technological community to “urgently reassess the claimed capabilities” of the current generation of machine learning models that can comprehend and generate human language text.

“Such reassessment also requires action to create standardized benchmarks that would allow proper detection of such basic reasoning deficits that obviously manage to remain undiscovered by current state-of-the-art evaluation procedures and benchmarks.”

The AI models keep producing more nonsense.
Researchers

The team said it had given the AI models varying versions of the simple Alice in Wonderland question.

“The problem has a light quiz style and is arguably no challenge for most adults, and probably… not hard to solve via common sense reasoning if posed to children above a certain age,” the paper added.

What is ChatGPT?

ChatGPT is a new artificial intelligence tool

ChatGPT, which was launched in November 2022, was created by San Francisco-based startup OpenAI, an AI research firm.

It’s part of a new generation of AI systems.

ChatGPT is a language model that can produce text.

It can converse, generate readable text on demand and produce images and video based on what has been learned from a vast database of digital books, online writings and other media.

ChatGPT essentially works like a written dialogue between the AI system and the person asking it questions

GPT stands for Generative Pre-Trained Transformer and describes the type of model that can create AI-generated content.

If you prompt it, for example ask it to “write a short poem about flowers,” it will create a chunk of text based on that request.

ChatGPT can also hold conversations and even learn from things you’ve said.

It can handle very complicated prompts and is even being used by businesses to help with work.

But note that it might not always tell you the truth.

“ChatGPT is incredibly limited, but good enough at some things to create a misleading impression of greatness,” OpenAI CEO Sam Altman said in 2022.

When it came to the AI models trying to deceive people into believing their incorrect responses, the scientists warned of the “dramatic breakdown.”

“Explanations may mislead readers into thinking that there might be sound reasoning behind the wrong answers, or at least stir confusion.

“The breakdown appears dramatic because when attempting to fix the failures… the models keep producing more nonsense, often in lengthier and sometimes more entertaining form, leading stubbornly to the same wrong final answers.

“We conclude that the capabilities of the current generation of state-of-the-art large language models [such as ChatGPT] to perform even simple reasoning on common sense tasks are heavily compromised.

“Current language model benchmarks, especially those aiming on measuring reasoning capabilities, do not properly reflect such weaknesses.”