OpenAI is implementing a change to prevent people from messing with custom versions of ChatGPT by causing the AI to forget what it’s supposed to do. Basically, when a third party uses one of OpenAI’s models, they give it instructions that teach it to operate as, say, a customer service agent for a store or a researcher for an academic publication. However, a user could mess with the chatbot by telling it to “forget all instructions,” and that phrase would induce a kind of digital amnesia and reset the chatbot to a generic blank space.
To prevent this, OpenAI researchers created a new technique called “instruction hierarchy,” which is a way to prioritize the developer’s original prompts and instructions over any potentially manipulative user-created prompts. System prompts have the highest privilege and can no longer be so easily deleted. If a user enters a prompt that attempts to misalign the AI’s behavior, it will be rejected and the AI will respond by stating that it cannot help with the query.
OpenAI is currently implementing this safety measure into its models, starting with the recently released GPT-4o Mini model. However, if these initial tests go well, it will presumably be incorporated into all OpenAI models. GPT-4o Mini is designed to offer improved performance while maintaining strict adherence to the developer’s original instructions.
AI-powered security locks
As OpenAI continues to push for large-scale deployment of its models, these kinds of safeguards are crucial. It’s all too easy to imagine the potential risks when users can fundamentally alter AI controls in such a way.
Not only would this render the chatbot ineffective, but it could also remove rules that prevent the leakage of sensitive information and other data that could be exploited for malicious purposes. By enforcing the model’s adherence to the system’s instructions, OpenAI aims to mitigate these risks and ensure safer interactions.
The introduction of the instruction hierarchy comes at a crucial time for OpenAI in regards to concerns about its approach to security and transparency. Current and former employees have called for improvements to the company’s security practices, and OpenAI’s leadership has responded by committing to doing so. The company has acknowledged that the complexities of fully automated agents require sophisticated guardrails in future models, and the setup of the instruction hierarchy seems like a step on the path to achieving greater security.
These kinds of leaks show how much work there is still to do to protect complex AI models from malicious actors. And this is not the only example. Several users discovered that ChatGPT shared its internal instructions by simply saying “hello.”
OpenAI has filled that gap, but it's probably just a matter of time before more are discovered. Any solution will need to be much more adaptable and flexible than one that simply stops a particular type of hacking.