From in-silico to in-patient: deploying machine learning models in the real world
The AI in Medicine (AIM) group at the Hospital for Sick Children (SickKids) is an interdisciplinary team that seeks ways to facilitate clinical and research use of AI for the benefit of our patients.
The growth of machine learning (ML) research in the healthcare setting has been astounding. The annual number of PubMed papers using the words “machine learning” or “artificial intelligence” has been doubling every two years since 2015. There are now dozens of annual conferences in the field, ranging from drug discovery to personalized medicine to high-profile workshops. Algorithms that aim to diagnose pneumonia and diabetic retinopathy have a performance that purportedly exceeds human-level expertise. These research results have even prompted some thought leaders to express doubt about the future need for radiologists.
Despite the flurry of research headlines and media stories, if a patient today were to be asked how AI impacts them, they would probably be left scratching their head. This is not surprising, as there are only 29 FDA-approved AI/ML-based medical technologies as of March 2020. There will always be a gap between research and translation in any field. Yet this discrepancy seems unreasonably large in healthcare-based ML.
To better understand the challenges and successes of translating ML systems into clinical practice, the Hospital for Sick Children (SickKids) and the Vector Institute for Artificial Intelligence (Vector) hosted the Vector-SickKids Health AI Deployment Symposium on October 30–31, 2019. The symposium was attended by 166 computer scientists, healthcare professionals and other stakeholders from 10 North American institutions including St. Michael’s Hospital, the University of Michigan, John Hopkins, and Kaiser Permanente. The white paper of the event is available here.
Three key themes emerged at the symposium as being essential for translational success:
- Contextualization
- Life-cycle planning
- Stakeholder involvement
In the rest of this post, we will review these essential ideas by drawing on the speakers’ thoughts and our group’s own experience.
(1) No algorithm is an island entire of itself [Contextualization]
Medicine is an ancient profession. Attempts to apply the naive results of standard classification or regression models will fail to meet the needs of doctors and nurses. These professions have developed sophisticated ways of diagnosing and treating disease, and useful ML systems need to be able to answer their nuanced questions:
- “Which patients are likely to develop sepsis that I don’t already suspect of being at risk?”
- “How confident are you that time-to-relapse will occur by this time?”
ML systems that offer context-free predictions will struggle to translate to the bedside. One speaker at the symposium explicitly noted that clinicians are often interested in context rather than explainability. Successful projects, like SepsisWatch, were those that were carefully calibrated and integrated into existing clinical workflows.
One puzzle in medical research is why transfer learning often fails to occur across institutions on a technical level. This phenomenon has been demonstrated in both imaging and clinical contexts. The answer is likely a combination of factors: differences in clinical practice, variations in medical devices, intraoperator variability, etc. Yet even within an institution, algorithms will often fail to provide timely, relevant, or detailed information in a way that is clinically usable.
At our own hospital, AIM has been able to link model output explicitly to clinical practice. Predicting patient volumes (census) is useful both for staff planning and preparing hospital space. In the COVID-19 era, strict capacity constraints exist due to a limited number of isolation beds in the emergency department (ED). By training probabilistic models, we are able to map a continuous distribution of patient volumes to an explicit probability of different capacity constraints being met. The value of our ED census tool is not that it has a low mean-square error, but rather that it provides an answer that is relevant to our hospital’s clinical context.
(2) Into the wild [Life-cycle planning]
The goal of ML is to have systems that generalize (i.e. work well) on future data. Most research teams develop their own solutions for problems that occur on the road to translation. An ad hoc approach often leads to duplication of effort and sub-optimal decisions for the field as a whole. Planning for the entire life-cycle of an algorithm can help to reduce these inefficiencies.
Implementation science is a well-studied area of research, although most data scientists are unaware of it. A related field, change management, provides a set of leadership principles to understand why some systems get adopted and maintained, while others do not. These concepts are as applicable for a quality improvement team trying to implement a checklist for surgeons as they are for a ML team trying to embed a deep-learning classifier.
In healthcare-based ML, researchers are primarily concerned with using the “right” model, or creating a “representative” test set. Yet these methodological concerns are often unrelated to the operational questions such as:
- “What systems are needed to monitor and maintain an algorithm?”
- “How will feedback be enabled from stakeholders?”
Yet operational questions are not independent of algorithm design. Oftentimes they are essential in making sure the right building blocks are used. If a hospital’s IT system has a 30 minute latency, then a real-time model is infeasible, naturally. Algorithms which rank patients according to a risk score should be calibrated based on hospital resource constraints. If a diagnostic classifier is being used to assign testing, and only 100 lab tests are available per day at the hospital, then evaluation metrics should reflect this constraint. Bringing ML tools to life requires both data and delivery science.
Technical considerations are not unimportant however. Work by Dr. Wiens at U Michigan has shown that it is easy to “leak” information in a ML pipeline. For time series in particular, indexing needs to occur relative to an outcome-independent reference point. Good technical work is necessary but not sufficient for a successful translation.
One of the symposium speakers identified five phases in the life cycle of an algorithm: prioritization, assessment, development, deployment, and evaluation. Every hospital will have its own priorities. At SickKids, AIM has developed an intake questionnaire to identify those projects with the greatest partnership potential. The “Beta Principle” of software development does not work in healthcare: the stakes are too high. Instead deployment and evaluation must be tested in a “silent period”, when model output does not directly influence clinical care.
(3) From the bottom up [Stakeholder involvement]
Successful projects discussed at the symposium required the backing of the c-suite along with a clinical champion on the ground. In other words, healthcare-based ML systems require both top-down and grass-roots support. The Duke Institute for Health Innovation (DIHI) and Kaiser Permanente of Northern California were two such institutions that have a portfolio of successful projects. Both groups have been given a mandate from their hospital’s leadership and enjoy positive relationships with their clinical staff.
Stakeholder engagement is essential to prevent a solution being built in search of a problem. If you ask any doctor or nurse to describe their experience interacting with an EHR system, the reviews will not be glowing. Digitized systems require doctors to spend time doing data entry rather than seeing their patients. The prospect of another “computer” system that will send false alerts or make data demands will be met with a tepid response, at best. One approach to identify opportunities is to ask stakeholders to complete the Medical Madlib; a method similar to other project management strategies:
As a [decision maker], If I knew [information], I would do [intervention], to improve [measurable outcome]
Completing this exercise with stakeholders starts a two-way conversation. Technical expertise can adapt an algorithm’s output to match the desired “information”. Model trials can then be tested against specific “measurable outcomes”. The first iteration of the EWS system at Penn Medicine is a cautionary example of a tool that showed a strong performance during its silent period but failed to improve the targeted measurable outcome.
Conclusion
AIM has taken many of the lessons from the deployment symposium to heart. To meet our mandate of building AI solutions that improve clinical care we have partnered with clinical champions to build tools from the bottom up. A great example of this is the medical directives algorithm which can provide personalized lab and procedure orders for patients.
All AI models in development at SickKids are required to go through a silent period evaluation. During silent mode, the performance of the tools can be determined in real time. The output of the models is designed to assist a point in the clinical care pathway. Because the stakes of healthcare are high, our group has a commitment to testing all algorithms against measurable outcomes in their final assessment phase.
The divide between research and translation is large, but it is not insurmountable.
Erik Drysdale is a Machine Learning Specialist at the Hospital for Sick Children and the AI in Medicine (AIM) initiative.