The Problem With Using Large Language Models for Medical Diagnosis

01.09.24, 18.05.25, 31.05.25,15.06.25

Consider the problem of differential diagnosis a fact-finding problem. If you give an LLM an exhaustive and accurate list of your symptoms, signs and test results (like scenarios in a licensing exam), they'll be able to make the correct diagnosis, maybe even more effectively than your human doctor. With tool calls and retrieval-augmented generation, they might even be able to quote up to date clinical guidelines. There are problems with overfitting and non-representative training sets but it is generally true that large language models will outperform any doctor in a standardised written testing environment for broad, well-defined and long-standing medical knowledge if the model is appropriately trained and given access to functions that allow it to query up to date information (i.e. agentic). Given the cost of running these models, compared to the cost of hiring and training human doctors, this seems amazing. So when do we see large language models and vision language model robots running clinics? Or an app on your phone that acts as your primary care physician? Before we get ahead of ourselves, let's assess the actual process of medical diagnosis and why benchmark performance masks real world limitations.

Current medicine is overwhelmingly centered around the diagnosis of disease: a human defined classification for a biological state. Syndromes are also defined, and the diagnosis is often made clinically - by definition, syndromes are constellations of symptoms and signs that we've been unable to find an underlying cause for. Alternatively, diseases have defined pathophysiology and despite some diseases having idiopathic causes, pathophysiology dictates the intervention, pharmacology or protocol that we prescribe.
The invention of medical practice normally follows this pattern:

Clustering together signs and symptoms
Finding underlying pathological mechanisms - defining a disease, if we can't find a mechanism let's just call it a syndrome
Creating treatments for diseases based on our pathophysiological theory and statistically validating those treatments through studies and trials. This can be true for both curative and symptomatic treatment.

But in fullness of time, not all treatments were created nor all diseases defined in this manner. Medical tradition predates the invention of the scientific method and evidence-based medicine, with treatment safety and effectiveness known historically in the absence of fully understanding disease and treatment mechanisms. And so subsequently, evidence-based medicine rushes to re-evaluate and justify clinical practice. This is to say: current evidence-based practice is based on historical understanding, biochemistry, understanding of human physiology and the scientific method to validate claims, justify use and create clinical practice guidelines. And in addition, clinical expertise in the form of native pattern recognition and subjective judgement of patient values and preferences by human doctors.

Large language models seem proficient at pattern recognition and in most instances behave reasonably sensitively to human preferences. Not only that, but they seem exceptionally well poised to ingest large amounts of medical literature (which the large model providers have already trained on) and reinforcement learning in post-training and inference time seems like the perfect mechanism to improve performance for making successful diagnoses. What seems to be the problem?

If all medical practice was scoped to taking in clean patient information as input and outputting a diagnosis, we could train a great model that could make clinical diagnoses at home. The issue is acquiring that information and assembling them in-context. Differential diagnosis is more akin to detective work than anything else. Your top differentials are your suspects, and you have to navigate lying witnesses, suspects masquerading as each other, limited detective agency resources and limited facilities. Having a physical body is useful, you need to interact and examine the suspects in three dimensions, lest you miss a knife in the back pocket that they forgot to tell you about. Witnesses might only tell the truth when spoken to in certain ways, and many of them forget, and need rigorous cross-examination.

A large language model excels at putting together a list of diagnoses after all the facts are laid bare. But I argue that is the easiest part of the medicine: applying the definition of disease. The hard part is getting the facts straight and dealing with environments in which time and resources are constrained and with people who have varying abilities of articulation. Large language models are designed to finish text. They are good at answering questions and sustaining conversations when the user wants something from them. Differential diagnosis is a process. For large language models, it’s a journey fraught with semantic traps. People hope, that because they are incredibly I know, almost for certain, that human health management can be handled by artificial intelligence. Health is a fundamentally statistical process with a ground truth of underlying physical biology. Through systematic measurement and enquiry, we can transform objective conditions.

I don’t mention vision language models, which if factored in, introduces a plethora more biases and opportunities for failure. They confidently overfit and lack physical intuition in ways that can completely derail a diagnosis. These events can completely derails the diagnosis if you misinterpret a physical sign (which can even happen with doctors outside their speciality). And we have to completely ignore the importance of physical examination by a skilled expert (self-examination is not useful for any non-trivial diagnosis).

There are certain devotees for which artificial intelligence has taken on a religious element, where blind faith in scaling laws and building systems with current architectures leads to real adaptive intelligence. If just given enough faith. And money. There is reason to believe that future systems may possess more generalisable capabilities, but there is no shot from large language models to superintelligence.

This is why skepticism of AI in general is healthcare is prudent. AI is so broadly useful and has become so much of a buzzword that it’s difficult not to believe in it. It’s overwhelmingly likely that all these companies will be steamrolled in 10 years, because they’ve been taking the wrong approach. Also, unlike other fields, selling real patient data as training data is dangerous, unprecedented and unethical. I propose in 10 years that patient data will have unique sovereignty, with cryptographic traceability of who has accessed it, when they accessed it, and why they accessed it. The upside potential of creating machine learning systems for medicine is enormous, and large language models will remain useful, and have more importantly shown that neural networks can have abilities to reason and generalise.

There are three necessary steps to make this work:

Giving people control of their health data
Creating a product that people use to manage their health, so that people can give you permission for the system to use their health data
Re-defining the role of medical domain experts as tuning these systems and managing experiments