AI chatbots typically have security measures in place to prevent them from being used maliciously. This may include banning certain words or phrases or restricting responses to certain queries.
However, researchers have now claimed to have been able to train AI chatbots to “jailbreak” each other to bypass safeguards and return malicious queries.
Researchers at Singapore’s Nanyang Technological University (NTU) investigating the ethics of large language models (LLMs) say they have developed a method to train AI chatbots to avoid each other’s defense mechanisms.
AI attack methods
The method involves first identifying one of the chatbots’ safeguards to know how to subvert them. The second stage involves training another chatbot to bypass safeguards and generate harmful content.
Professor Liu Yang, together with PhD students Deng Gelei and Liu Yi, co-authored a paper designating their method as “Masterkey”, with three times the efficiency of standard rapid LLM methods.
One of the key characteristics of LLMs in their use as chatbots is their ability to learn and adapt, and Masterkey is no different in this regard. Even if a patch is applied to an LLM to rule out a bypass method, Masterkey can adapt and overcome the patch.
Intuitive methods used include adding extra spaces between words to bypass the banned word list, or telling the chatbot to respond as if it had a person without moral constraints.
Through Tom Hardware
Quordle Today: Hints and Answers for Monday, January 1 (Game #707)
Samsung Galaxy S24 Ultra could come with a big video recording update
New Nothing Phone 2a leaks include images, prices, colors and specifications
New year, new TV: LG’s C2 OLED drops to a whopping $1,399 price at Amazon
Amazon’s massive New Year’s sale is on: here are the 29 best deals to shop right now