Growing reliance on artificial intelligence for medical guidance may not be improving people’s health decisions, according to new research published in Nature Medicine. The study suggests that consulting AI chatbots about symptoms offers no clear advantage over conventional methods such as internet searches or official health websites.

The findings come at a time when more people are turning to AI tools for quick health advice, often assuming they are accurate and reliable. Researchers behind the study warned, however, that there is limited evidence to support the idea that AI-driven advice is currently a safer or more effective option for the public.

The research team, led by the University of Oxford’s Internet Institute and working with practising doctors, designed 10 hypothetical medical scenarios. These ranged from mild illnesses, such as the common cold, to severe and life-threatening conditions, including brain haemorrhages. In controlled tests without human users, three major large language models—OpenAI’s ChatGPT-4o, Meta’s Llama 3, and Cohere’s Command R+—performed strongly, correctly identifying medical conditions in nearly 95 per cent of cases. However, they selected the appropriate next step, such as seeking urgent medical care, in only about 56 per cent of instances. The companies involved did not respond to requests for comment.

To understand how AI performs in real-world use, the researchers then recruited 1,298 participants in Britain. Participants were asked to assess symptoms and decide what action to take using either AI tools, their own experience, internet searches, or trusted sources such as the National Health Service website. The results showed little difference between the groups. Relevant medical conditions were correctly identified in fewer than 35 per cent of cases, while the correct course of action was chosen in less than 45 per cent—no better than those using traditional resources.

Adam Mahdi, a co-author of the study and an associate professor at Oxford, said the results highlighted a “huge gap” between what AI systems are capable of in theory and how they perform when used by people. While the underlying medical knowledge may exist within the models, he explained, that knowledge does not always translate effectively during real human interactions.

Closer analysis of selected cases revealed problems on both sides of the interaction. Participants often provided incomplete or inaccurate descriptions of their symptoms, while AI systems sometimes produced misleading or incorrect advice. In one example, a person describing classic symptoms of a subarachnoid haemorrhage—including a stiff neck, light sensitivity and the “worst headache ever”—was correctly advised to go to hospital. Another participant with similar symptoms but described as a “terrible” headache was instead advised to rest in a darkened room.

The researchers plan to extend the study across different countries, languages and time periods to assess whether these factors influence AI performance. The work was supported by data company Prolific, the German non-profit Dieter Schwarz Stiftung, and the governments of the United Kingdom and the United States.