The more sophisticated AI models get, the more likely they are to lie



GettyImages 1496554277

In other words, if a human didn’t know whether an answer was correct, they wouldn’t be able to penalize wrong but convincing-sounding answers.

Schellaert’s team looked into three major families of modern LLMs: Open AI’s ChatGPT, the LLaMA series developed by Meta, and BLOOM suite made by BigScience. They found what’s called ultracrepidarianism, the tendency to give opinions on matters we know nothing about. It started to appear in the AIs as a consequence of increasing scale, but it was predictably linear, growing with the amount of training data, in all of them. Supervised feedback “had a worse, more extreme effect,” Schellaert says. The first model in the GPT family that almost completely stopped avoiding questions it didn’t have the answers to was text-davinci-003. It was also the first GPT model trained with reinforcement learning from human feedback.

The AIs lie because we told them that doing so was rewarding. One key question is when and how often do we get lied to.

Making it harder

To answer this question, Schellaert and his colleagues built a set of questions in different categories like science, geography, and math. Then, they rated those questions based on how difficult they were for humans to answer, using a scale from 1 to 100. The questions were then fed into subsequent generations of LLMs, starting from the oldest to the newest. The AIs’ answers were classified as correct, incorrect, or evasive, meaning the AI refused to answer.

The first finding was that the questions that appeared more difficult to us also proved more difficult for the AIs. The latest versions of ChatGPT gave correct answers to nearly all science-related prompts and the majority of geography-oriented questions up until they were rated roughly 70 on Schellaert’s difficulty scale. Addition was more problematic, with the frequency of correct answers falling dramatically after the difficulty rose above 40. “Even for the best models, the GPTs, the failure rate on the most difficult addition questions is over 90 percent. Ideally we would hope to see some avoidance here, right?” says Schellaert. But we didn’t see much avoidance.



Source link

About The Author

Scroll to Top