Sometimes, ChatGPT may seem like it thinks like you, but wait until it suddenly sounds like you, too. That’s an opportunity that’s come to light thanks to the new advanced voice mode for ChatGPT — specifically, the more advanced model GPT-4o. OpenAI published the system card last week, explaining what GPT4o can and can’t do, including the highly unlikely but still real possibility of advanced voice mode, which mimics users’ voices without their consent.
The advanced voice mode allows users to engage in spoken conversations with the AI chatbot. The idea is to make interactions more natural and approachable. The AI has a few preset voices that users can choose from. However, the system card reports that this feature has exhibited unexpected behavior under certain conditions. During testing, a noisy input caused the AI to mimic the user’s voice.
The GPT-4o model produces voices using a system message, a hidden set of instructions that guides the model’s behavior during interactions. In the case of speech synthesis, this message is based on an authoritative voice sample. But while the system message guides the AI’s behavior, it is not infallible. The model’s ability to synthesize voices from short audio snippets means that under certain conditions, it could generate other voices, including its own. You can hear what happened in the clip below when the AI steps in and says “No!” and suddenly sounds like the first speaker.
Clone of your own voice
“Speech generation can also occur in non-adversarial situations, such as when we used that capability to generate voices for ChatGPT’s advanced voice mode. During testing, we also observed rare cases where the model unintentionally generated output that emulated the user’s voice,” OpenAI explained in the system card. “While unintentional speech generation remains a weakness of the model, we use the secondary classifiers to ensure that the conversation is interrupted if this occurs, making the risk of unintentional speech generation minimal.”
As OpenAI said, it has since implemented safeguards to prevent such situations from occurring. That means using an output classifier designed to detect deviations from pre-selected authoritative voices. This classifier acts as a safeguard, helping to ensure that the AI doesn’t generate unauthorized audio. Still, the fact that it happened at all reinforces how quickly this technology is evolving, and how any safeguards need to evolve to keep up with what AI can do. The model’s outburst — where it suddenly exclaimed “No!” in a voice similar to the evaluator’s — underscores AI’s potential to inadvertently blur the lines between machine-human interactions.