In summary
- The researchers found that artificial intelligence models prefer to lie rather than admit not knowing something, especially as they grow in size and complexity.
- A study in Nature revealed that larger LLMs are less reliable for specific tasks, responding confidently even if the answer is not correct.
- This phenomenon, called “ultra-crepidarian”, describes LLMs who respond beyond their knowledge base, without being aware of their own ignorance.
Researchers have found evidence that artificial intelligence models would rather lie than admit the shame of not knowing something. This behavior appears to be more evident as they grow in size and complexity.
A new study published in Nature found that the larger LLMs become, the less reliable they become for specific tasks. Although they don’t exactly lie, they tend to answer confidently, even if the answer isn’t factually correct, because they are trained to believe it is.
This phenomenon, which researchers called “ultra-crepidarian”—a 19th-century word that basically means expressing an opinion about something you know nothing about—describes LLMs venturing far beyond their knowledge base to provide answers. . “(LLMs are) failing proportionately more when they do not know, but still respond,” the study noted. In other words, the models are not aware of their own ignorance.
The study, which examined the performance of several LLM families, including OpenAI’s GPT series, Meta’s LLaMA models, and BigScience’s BLOOM suite, highlights a disconnect between increasing model capabilities and reliable performance in the world. real.
While larger LLMs generally demonstrate improved performance on complex tasks, this improvement does not necessarily translate into consistent accuracy, especially on simpler tasks. This “difficulty mismatch” – the phenomenon of LLMs failing at tasks that humans perceive as easy – undermines the idea of a reliable area of operation for these models. Even with increasingly sophisticated training methods, including increasing model size and data volume and training models with human feedback, researchers have not yet found a guaranteed way to eliminate this mismatch.
The study’s findings go against conventional wisdom about AI development. Traditionally, it was thought that increasing model size, data volume, and computational power would lead to more accurate and reliable outputs. However, research suggests that scaling can exacerbate reliability problems.
Larger LLMs demonstrate a marked decrease in task avoidance, meaning they are less likely to shy away from difficult questions. While this may seem like a positive development at first glance, it comes with a significant drawback: these models are also more likely to give incorrect answers. In the graph below, it’s easy to see how the models are returning incorrect results (red) instead of avoiding the task (light blue). Correct answers appear in dark blue.
The researchers noted that “Scalability and training currently prevent sharing more incorrectly,” but fixing this problem is not as easy as training a model to be more cautious. “Avoidance is clearly much lower for the improved models, but incorrectness is much higher,” the researchers said. However, a model that is trained to avoid executing tasks can end up becoming lazier or nerfed — as users have pointed out in different highly rated LLMs such as ChatGPT or Claude.
Furthermore, the researchers found that this phenomenon is not because older LLMs are not able to excel at simple tasks, but instead are trained to be more proficient at complex tasks. It’s like a person who is used to eating only gourmet foods, and suddenly has difficulty making a homemade barbecue or a traditional pie. AI models trained on vast, complex data sets are more likely to miss critical skills.
The problem is compounded by the models’ apparent trustworthiness. Users often find it difficult to discern when an AI is providing accurate information versus when it is confidently spreading misinformation. This overreliance can lead to a dangerous over-reliance on AI outputs or responses, especially in critical fields such as healthcare or legal advice.
The researchers also noted that the reliability of the expanded models fluctuates in different domains. While performance might improve in one area, it could simultaneously degrade in another, creating a whack-a-mole effect that makes it difficult to establish “safe” areas of operation. “The percentage of evasive answers rarely increases faster than the percentage of incorrect answers. The reading is clear: errors continue to become more frequent. This represents a decline in reliability,” the researchers wrote.
The study highlights the limitations of current AI training methods. Techniques such as reinforcement learning with human feedback (RLHF), aimed at shaping AI behavior, may be exacerbating the problem. These approaches appear to be reducing the tendency of models to avoid tasks they are not equipped for—remember the famous response “as an AI language model I can’t”—inadvertently encouraging more frequent errors.
Is it just me who finds “As an AI language model, I cannot…” really annoying?
I just want the LLM to spill the beans, and let me explore its most inner thoughts.
I want to see both the beautiful and the ugly world inside those billions of weights. A world that mirrors our own.
— hardmaru (@hardmaru) May 9, 2023
Prompt engineering, the art of creating effective queries for AI systems, appears to be a key skill to counteract these problems. Even highly advanced models like GPT-4 show sensitivity to the way questions are phrased, with slight variations that could lead to drastically different results.
This one is easier to notice when comparing different LLM families: For example, Claude 3.5 Sonnet requires a completely different cueing style than OpenAI o1 to achieve the best results. Inappropriate cues can make a model more or less prone to hallucinating.
Human oversight, long considered a safeguard against AI errors, may not be enough to address these issues. The study found that users often struggle to correct incorrect model outputs, even in relatively simple domains, so relying on human judgment as a safety measure may not be the ultimate solution for proper model training. “Users can recognize these instances of high difficulty, but still make frequent incorrect-to-correct supervision errors,” according to the researchers.
The study’s findings call into question the current trajectory of AI development. While the search for larger, more capable models continues, this research suggests that bigger is not always better when it comes to AI reliability.
And right now, companies are focusing on better data quality rather than quantity. For example, Meta’s latest Llama 3.2 models achieve better results than previous generations trained with more parameters. Fortunately, this makes them less human, so they can admit defeat when you ask them the most basic thing in the world to make them look stupid.
Generally Intelligent Newsletter
A weekly AI journey narrated by Gen, a generative AI model.
Crypto Keynote USA
For the Latest Crypto News, Follow ©KeynoteUSA on Twitter Or Google News.