The first systematic review and meta-analysis of its kind finds that artificial intelligence (AI) is just as good at diagnosing a disease based on a medical image as healthcare professionals. However, more high quality studies are necessary.
A new article examines the existing evidence in an attempt to determine whether AI can diagnose illnesses as effectively as healthcare professionals.
To the authors’ knowledge — that is, a vast team of researchers led by Professor Alastair Denniston from the University Hospitals Birmingham NHS Foundation Trust in the United Kingdom — this is the first systematic review that compares AI performance with medical professionals for all diseases.
Prof. Denniston and team searched several medical databases for all studies published between 1st of January 2012 and 6th of June 2019. The team published the results of their analysis in the journal The Lancet Digital Health.
The researchers looked for studies that compared the diagnostic effectiveness of deep learning algorithms with that of healthcare professionals when they had made a diagnosis based on medical imaging.
They examined the quality of the reporting in said studies, their clinical value, and the studies’ design.
Furthermore, when it came to assessing the AI’s diagnostic performance compared with that of healthcare professionals, the researchers looked at two outcomes: specificity and sensitivity.
“Sensitivity” defines the probability that a diagnostic tool gets a positive result in people who have the disease. Specificity refers to the accuracy of the diagnostic test, which complements the sensitivity measure.
The selection process yielded only 14 studies whose quality was high enough to include in the analysis. Prof. Denniston explains, “We reviewed over 20,500 articles, but less than 1% of these were sufficiently robust in their design and reporting that independent reviewers had high confidence in their claims.”
“What’s more, only 25 studies validated the AI models externally (using medical images from a different population), and just 14 studies compared the performance of AI and health professionals using the same test sample.”
“Within that handful of high quality studies, we found that deep learning could indeed detect diseases ranging from cancers to eye diseases as accurately as health professionals. But it’s important to note that AI did not substantially outperform human diagnosis.”
Prof. Alastair Denniston
More specifically, the analysis found that AI can correctly diagnose disease in 87% of the cases, whereas detection by healthcare professionals yielded an 86% accuracy rate. The specificity for deep learning algorithms was 93%, compared with humans’ at 91%.
Prof. Denniston and colleagues also draw attention to several limitations they found in studies that examine AI diagnostic performance.
Firstly, most studies examine AI and healthcare professionals’ diagnostic accuracy in an isolated setting that does not mimic regular clinical practice — for example, depriving doctors of additional clinical information they would usually need to make a diagnosis.
Secondly, say the researchers, most studies compared datasets only, whereas high quality research in diagnostic performance would require making such comparisons in people.
Furthermore, all studies suffered from poor reporting, say the authors, with analysis not accounting for information that was missing from said datasets. “Most [studies] did not report whether any data were missing, what proportion this represented, and how missing data were dealt with in the analysis,” write the authors.
Additional limitations include inconsistent terminology, not clearly setting a threshold for sensitivity and specificity analysis, and the lack of out-of-sample validation.
“There is an inherent tension between the desire to use new, potentially life-saving diagnostics and the imperative to develop high quality evidence in a way that can benefit patients and health systems in clinical practice,” comments first author Dr. Xiaoxuan Liu from the University of Birmingham.
“A key lesson from our work is that in AI — as with any other part of healthcare — good study design matters. Without it, you can easily introduce bias which skews your results. These biases can lead to exaggerated claims of good performance for AI tools which do not translate into the real world.”
Dr. Xiaoxuan Liu
“Evidence on how AI algorithms will change patient outcomes needs to come from comparisons with alternative diagnostic tests in randomized controlled trials,” adds co-author Dr. Livia Faes from Moorfields Eye Hospital, London, UK.
“So far, there are hardly any such trials where diagnostic decisions made by an AI algorithm are acted upon to see what then happens to outcomes which really matter to patients, like timely treatment, time to discharge from hospital, or even survival rates.”