How well can AI chatbots mimic doctors in a treatment setting?


Dr. Scott Gottlieb is a physician and served as the 23rd commissioner of the U.S. Food and Drug Administration. He is a contributor to CNBC and serves on the boards of Pfizer and several other healthcare and technology startups. He is also a partner at the venture capital firm New Enterprise Associates. Shani Benezra is a senior research associate at the American Enterprise Institute and a former associate producer for CBS News’ Face the Nation.

Many consumers and medical providers are turning to chatbots, which are powered by large language models, to answer medical questions and inform about treatment options. We decided to see if there were any major differences between the major platforms when it came to their clinical suitability.

To obtain a medical license in the United States, aspiring physicians must successfully pass three stages of the United States Medical Licensing Examination (USMLE), with the third and final stage widely considered to be the most difficult. It requires candidates to answer about 60% of the questions correctly, and historically the average passing score was around 75%.

When we put the leading large language models (LLMs) through the same Step 3 test, they performed significantly better, achieving scores that significantly outperformed many clinicians.

But there were some clear differences between the models.

The USMLE Step 3 exam, typically taken after the first year of residency, measures whether medical graduates can apply their knowledge of clinical science to unsupervised medical practice. It assesses a new physician's ability to manage patient care across a wide range of medical disciplines and includes multiple-choice questions and computer-based case simulations.

We isolated 50 questions from the 2023 USMLE Step 3 sample test to assess clinical competence from five different large language models, feeding the same set of questions to each of these platforms: ChatGPT, Claude, Google Gemini, Grok and Flame.

Other studies have evaluated the medical efficacy of these models, but to our knowledge, this is the first time these five leading platforms have been compared in a head-to-head evaluation. These results may offer consumers and providers some insight into where they should turn.

Here's how they scored:

  • ChatGPT-4o (Open AI): 49/50 correct questions (98%)
  • Claude 3.5 (Anthropic) — 45/50 (90%)
  • Gemini Advanced (Google): 43/50 (86%)
  • Grok (xAI) — 42/50 (84%)
  • HuggingChat (Llama) — 33/50 (66%)

In our experiment, OpenAI’s ChatGPT-4o emerged as the best performer, scoring 98%. It provided detailed medical analysis and used language reminiscent of that of a medical professional. Not only did it provide answers with extensive reasoning, but it also contextualized its decision-making process and explained why alternative answers were less appropriate.

Anthropic’s Claude came in second with a score of 90%. He offered more human-like responses, using simpler language and a bullet point structure that could be more accessible to patients. Gemini, which scored 86%, offered responses that weren’t as thorough as ChatGPT or Claude, making his reasoning difficult to decipher, but his answers were concise and direct.

Grok, Elon Musk's xAI chatbot, scored a respectable 84%, but it didn't provide descriptive reasoning during our analysis, making it difficult to understand how it arrived at its answers. While HuggingChat, an open-source website built from Goals Llama — scored the lowest at 66%, but showed good reasoning for the questions he answered correctly, providing concise answers and links to sources.

One question that most models got wrong concerned a 75-year-old woman with hypothetical heart disease. The question asked doctors what the most appropriate next step was as part of their evaluation. Claude was the only model that generated the correct answer.

Another notable question focused on a 20-year-old patient who was presenting with symptoms of a sexually transmitted infection. It asked physicians which of five options was the appropriate next step as part of their assessment. ChatGPT correctly determined that the patient should be scheduled for an HIV serology test in three months, but the model went further, recommending a follow-up exam in a week to ensure that the patient’s symptoms had resolved and that antibiotics covered his strain of infection. For us, the response highlighted the model’s capacity for broader reasoning, expanding beyond the binary options presented by the exam.

These models were not designed for medical reasoning; they are products of the consumer technology sector, designed to perform tasks such as language translation and content generation. Despite their non-medical origins, they have shown a surprising aptitude for clinical reasoning.

New platforms are being created specifically to solve medical problems. Google recently introduced Med-Gemini, a refined version of its previous Gemini models, optimized for medical applications and equipped with web-based search capabilities to improve clinical reasoning.

As these models evolve, their ability to analyze complex medical data, diagnose diseases and recommend treatments will become more acute. They can offer a level of accuracy and consistency that human providers, limited by fatigue and error, might sometimes struggle to match, and pave the way to a future in which treatment portals can be driven by machines, rather than doctors.

scroll to top