Last year the California Health Care Foundation (CHCF) sponsored a $100,000 competition using the Kaggle data science community to find the best algorithms, or computer programs, for detecting diabetic retinal disease from digital images. In all, 661 teams of engineers and researchers throughout the world submitted 6,999 algorithms. By the fourth month of the competition, algorithms were matching the diagnostic performance of humans. After six months of competition, Professor Ben Graham of the University of Warwick in the United Kingdom was declared the winner. Graham had developed an algorithm that performed "better" than typical human graders, with an accuracy score of 86%. I will explain why "better" is in quotes and what this means for actual clinical care of diabetic eye disease, but first let's examine the rationale for the competition.
Diabetic eye disease, specifically diabetic retinopathy (DR), is the leading cause of permanent blindness in the working-age population. It afflicts more than 93 million people worldwide, and in the US alone it causes blindness in about 24,000 people each year. This is tragic because it's preventable. The blinding effects of DR can be averted with a relatively inexpensive laser and injection treatment that is 90% to 95% effective in preventing vision impairment when performed in a timely manner, according to the US Centers for Disease Control and Prevention.
Too many patients, however, don't receive timely screening. Even in advanced stages, DR is usually asymptomatic, causing many with sight-threatening diabetic retinal disease to miss the opportunity for effective treatment.
In 2005, CHCF launched a program to detect sight-threatening diabetic retinal disease early enough to avoid vision impairment. The foundation worked with the UC Berkeley Optometric Eye Center and EyePACS, a web-based system for providing diagnostic eye care services, to develop a screening program in the Central Valley that placed digital retinal cameras in primary care clinics.
In this program, community health workers — usually medical assistants — photograph the retinas of diabetes patients and transmit the images to a remote panel of licensed ophthalmologists and optometrists for analysis. Within 24 hours, consultants provide a report to the primary care providers indicating the level of DR and other eye diseases. The program has grown considerably, and retinal cameras can be found at hundreds of community clinics throughout the US. The EyePACS database currently has more than a million retinal images from more than 375,000 patient encounters.
While the program has improved access to retinal exams, a nagging question remains as to whether it has decreased vision impairment from DR. While cameras over the years have become easier to use, the consultations still require humans to grade the images at a remote location. A delay of even one hour to receive the results means that patients will no longer be at the clinic to learn that they have sight-threatening disease. They need to be contacted and to return for follow-up instructions for obtaining additional care. This is a bigger hurdle than one might think.
In 2009, CHCF commissioned Robert Quade, PhD, to find out what happens to patients with sight-threatening DR who are referred to specialists by their primary care physicians. He found that only 23% of 288 patients with advanced retinal disease from four high-performing clinics ever made it to ophthalmology care. Patients fell out at every step of the process: 15% never learned about their disease, another 15% did not receive an appointment, 22% did not attend their appointments, and 25% opted out of treatment. Algorithms that help a computer immediately read retinal images might improve diabetic eye care outcomes by providing rapid results to community health workers and clinicians for triage and referral to specialists.
A Different Approach to Diagnosis
For the past 20 years, hundreds of research laboratories throughout the world have been developing algorithms for DR detection and grading. Algorithms were usually designed and tested against established training sets totaling about 5,000 images from many nations, including France, Finland, the US, and the Netherlands. Unfortunately, algorithms have not performed very well in real-life settings. Clinicians could not rely on the algorithms for accuracy, because the testing quantity was low; the images were from racially homogeneous groups of patients, and their retinas look different from those of other racial groups because of different pigmentation; and there was greater variability in the quality of images captured by technicians in busy clinics than was found in more controlled settings.
The Kaggle competition took a different approach, using an EyePACS collection of 100,000 images of 50,000 patients generated by community clinic screening sites. The images were captured by a wide variety of retinal cameras and focused on a diverse population of diabetic patients.
Although the images in the Kaggle and the human consultants' diagnoses are sometimes imperfect, the large number of pictures analyzed allows for the development of accurate algorithms. In other words, the fuzziness of the data is offset by the quantity of data. Images were graded on a five-level severity scale (zero is no disease and four is high-risk, sight-threatening disease). The competition set was then uploaded to the Kaggle site, and contestants could download half of the set with the consultants' grades in order to train their algorithms. The other half was used to test the algorithms. Submissions were scored based on the quadratic weighted kappa, which measures the agreement between two ratings (the contestant and the actual human score for each image). This metric typically varies from 0 (random agreement between raters) to 1 (complete agreement between raters).
When well-trained humans are compared to each other, the score for grading retinal images is usually around 80%, which is considered excellent because small variations in grades don't usually affect how a patient is managed. That level was reached about four months into the competition. By six months, at least a dozen teams had surpassed that level. Graham, who achieved an impressive 86% score, had not worked with retinal images before the competition. He used "spatially sparse convolutional neural networks," a version of "deep learning" to process the winning submission (PDF). The other high-scoring contestants also used deep learning, and the competition was very close. You can see more details on the Kaggle retinopathy competition website. The competition rules require the algorithms to be open-sourced, so they are freely available to the world.
The Limits of an App
The remarkable results of the Kaggle competition show that computers can indeed detect sight-threatening diabetic retinopathy better than human experts, but it is just a start to developing technology and systems that can truly get people into treatment early enough for the treatment to be most effective. Perhaps this will be the goal of another grand technology challenge, but it may also be the reason why we still need humans in the workflow.
Will these algorithms replace humans? Will the algorithms truly help prevent blindness? Many, if not most, patients will seek treatment for diabetic eye disease when the disease has worsened to a level where the effectiveness of treatment is reduced by half. Many wait too long even when they have been informed of their sight-threatening condition. We should not expect that an app that simply shows a green light for good retinas and a red light for bad retinas will make much difference in preventing blindness from diabetes.
Patients still must be advised, educated, and cajoled to take charge of their health care and overcome fear, mistrust, misinformation, cost, and other obstacles to seeking care. Unfortunately, there isn't an app for that — at least not yet!