Popular AI tools like GPT-4 generate fluid, human-like text and perform so well on a variety of linguistic tasks that it's becoming increasingly difficult to tell whether the person you're conversing with is a human or a machine.
This scenario mirrors Alan Turing's famous thought experiment, where he proposed a test to assess whether a machine could exhibit human-like behavior to the extent that a human judge could no longer reliably distinguish between man and machine based solely on in their responses.
The Department of Cognitive Sciences at UC San Diego decided to see how modern AI systems worked and evaluated ELIZA (a simple rules-based chatbot from the 1960s included as a benchmark in the experiment), GPT-3.5 and GPT-4 in a controlled manner. Turing test. Participants had a five-minute conversation with a human or an AI and then had to decide whether their interlocutor was human.
No better than chance
GPT-4 and GPT-3.5 were told how to respond to messages. The researchers instructed the model to “adopt the personality of a young man who was very terse, did not take the game very seriously, used slang, and occasionally made spelling errors.” Additionally, the message contained “general information about the game settings and information about recent news that occurred after the interruption of model training.” The two models were accessed via the OpenAI API with temperature set to 1, and to prevent the AI from responding too quickly, their messages were sent with a delay based on the number of characters in the message.
In the study, 500 participants were assigned to one of five groups. They engaged in a conversation with a human or one of three artificial intelligence systems. The game interface looked like a typical messaging app. After five minutes, participants judged whether they believed their interlocutor was human or AI and provided reasons for their decisions.
The results were interesting. GPT-4 was identified as human 54% of the time, ahead of GPT-3.5 (50%), and both significantly outperformed ELIZA (22%) but fell behind real humans (67%). The participants were just lucky to identify GPT-4 as AI, indicating that current AI systems can fool people into thinking they are human.
Analysis of the results showed that interrogators often relied on linguistic style, socio-emotional factors, and knowledge-based questions to decide whether they were talking to a human or a machine.
Test details and full results can be seen in the article published on arXiv preprint server.