Supplement
2 June 2020

Reporting and Implementing Interventions Involving Machine Learning and Artificial IntelligenceFREE

Publication: Annals of Internal Medicine
Volume 172, Number 11_Supplement

Abstract

Increasingly, interventions aimed at improving care are likely to use such technologies as machine learning and artificial intelligence. However, health care has been relatively late to adopt them. This article provides clinical examples in which machine learning and artificial intelligence are already in use in health care and appear to deliver benefit. Three key bottlenecks toward increasing the pace of diffusion and adoption are methodological issues in evaluation of artificial intelligence–based interventions, reporting standards to enable assessment of model performance, and issues that need to be addressed for an institution to adopt these interventions. Methodological best practices will include external validation, ideally at a different site; use of proactive learning algorithms to correct for site-specific biases and increase robustness as algorithms are deployed across multiple sites; addressing subgroup performance; and communicating to providers the uncertainty of predictions. Regarding reporting, especially important issues are the extent to which implementing standardized approaches for introducing clinical decision support has been followed, describing the data sources, reporting on data assumptions, and addressing biases. Although most health care organizations in the United States have adopted electronic health records, they may be ill prepared to adopt machine learning and artificial intelligence. Several steps can enable this: preparing data, developing tools to get suggestions to clinicians in useful ways, and getting clinicians engaged in the process. Open challenges and the role of regulation in this area are briefly discussed. Although these techniques have enormous potential to improve care and personalize recommendations for individuals, the hype regarding them is tremendous. Organizations will need to approach this domain carefully with knowledgeable partners to obtain the hoped-for benefits and avoid failures.
Health care costs continue to grow, and expectations are great that machine learning (ML) (Glossary) and artificial intelligence (AI) (Glossary) will help providers reduce costs and make better decisions that are tailored to the needs and characteristics of individual patients (1–4). The public is already exposed to ML in everyday life—it is, for example, how Amazon predicts what products you might like to buy, or how a search engine displays documents relevant to a query—but its advance into health care has been slow compared with its use in other industries.
Glossary
Glossary
Three key bottlenecks toward increasing the pace of diffusion and adoption of ML and AI are 1) methodological issues involved in increasing the robustness of AI-based interventions, 2) reporting standards to enable assessment of performance of models, and 3) issues that need to be addressed for an institution to adopt these interventions. We describe these issues and discuss steps for closing gaps (Table).
Table. Recommendations for Study Design and Conduct and Reporting in Interventions Involving Machine Learning and Artificial Intelligence
Table. Recommendations for Study Design and Conduct and Reporting in Interventions Involving Machine Learning and Artificial Intelligence

ML and AI

In ML, algorithms are obtained by learning from data instead of hand-generating them on the basis of expert knowledge alone. The supervisory signal (Glossary) available in the data can be of varying types, such as supervised (Glossary), semi-supervised, weakly supervised, unsupervised (Glossary), or reinforcement (4).
For example, consider a goal to predict an output (for example, risk for mortality) using inputs (for example, an individual's demographic information and comorbidities). When a large data set of input and corresponding gold-standard outputs are given, supervised learning algorithms can be used. When gold-standard outputs are scarce and difficult to obtain, semi-supervised learning techniques can sometimes help to improve the performance of supervised learning algorithms by leveraging additional unlabeled samples (that is, inputs without outputs). Such techniques are useful because it is often much easier to collect inputs (for example, chest radiographs) than to collect both inputs and outputs (for example, chest radiographs with extensive annotation by a radiologist).
Finally, in unsupervised learning, there is no target output to predict. Instead, the goal is often to extract insight from data (for example, by discovering clusters). Unsupervised learning is often a helpful tool for generating hypotheses given a large data set. Deep learning (Glossary) is when learning algorithms use a flexible class of hierarchical models that can approximate highly complex relationships (5). These models are not new and date back to the 1950s, but the ability to learn these from data was limited by relatively small data sets and less powerful computers. The availability of richer models combined with new learning techniques is enabling the development of such technologies as driverless cars.
The term artificial intelligence refers to a subfield of computer science that studies algorithms to perform tasks typically associated with cognition. Such tasks, such as object recognition from images and speech recognition from audio, are classic examples of AI. Machine learning is a technique for achieving AI, but it can also be directed at other ends. In health care, unsupervised approaches are generally used for hypothesis development, whereas supervised approaches are used more to generate predictions and recommendations. Deep learning is being widely used in health care, especially for such applications as image analysis, but also in many other scenarios.

Clinical Examples

Recently in the literature, a range of evaluation studies of clinical ML tools have been reported, addressing many conditions. Overall, it appears that the best approach is to use ML to supplement the performance of clinicians rather than to replace them.
Because it carries a high mortality rate and early detection seems to make a difference, sepsis has been a major focus of ML evaluations (6–10), with many studies using the health-related MIMIC data sets (11). To detect sepsis clinically, requirements include access to vital sign and laboratory data in real time; tools to predict who may have sepsis, including assessment of the likelihood that it is present; and development of approaches to contact the responsible clinician rapidly and enable them to take action, as well as track whether they did so.
Overall, because early recognition and treatment may improve outcomes, studies of algorithms for predicting physiologic deterioration are also common (12–16), with ML-based indexes often outperforming the state of the art in expert-derived warning scores. Machine learning approaches to predict falls by using clinical data (17), and to detect falls by using wearable sensors and cameras (18–20), are frequent topics of study.
Most of these studies involved retrospective back-testing of algorithms on existing data. However, there are few prospective trials of ML-guided interventions. An autonomous AI system to diagnose diabetic retinopathy in the primary care setting, IDx-DR, recently became the first AI-based device to be approved by the U.S. Food and Drug Administration (FDA) (21). The system's developer, IDx LLC, published a pivotal trial of the system, enrolling 900 diabetic patients who had their eyes imaged. The images were then interpreted by an AI system (22). As a pivotal trial, the authors designed the study according to the FDA's standards, registered it, and used the results to support their successful de novo premarket review application to the FDA.
These are just some representative examples. Undoubtedly, ML will be used to improve performance in many scenarios, ranging from the clinical to administrative.

Methodological Issues to Address in Evaluating AI-based Interventions

New methods are needed to evaluate AI-based interventions in 2 key areas. First, AI-based interventions need to be externally validated, and this is more complex than with traditional predictive approaches. Second, AI-based interventions need to accurately quantify and clearly communicate uncertainty, and tools are needed to assess whether and how well this has been done.

External Validation and Bias Correction

One of the most critical needs in the evaluation of artificial intelligence approaches is external validation, mainly because any model that is developed within 1 data set will reflect the idiosyncrasies of that data set and will thus perform less well when the model is taken to new settings. However, this is not being routinely done. For example, in a recent special issue on AI in PLoS Medicine, the editors (including author Suchi Saria) had great difficulty finding models that had been externally validated.
Performance should not only be evaluated in a separate data set, it should ideally be evaluated externally. One evaluation of how validation was done (23) found many issues: 16% of articles did not include the number of outcomes in the validation set, 54% did not mention the frequency of missing data, and 67% did not report model calibration. For sepsis, for example, certain factors may be more predictive of sepsis in one institution than in another because of issues relating to case mix. The current marketplace is full of small and large companies promising great results that are difficult to assess, in part because of lack of validation.
Although external validation on data from a new hospital is a step in the right direction, this type of evaluation is fundamentally reactive. For each new environment, it is important to check that an algorithm's performance does not degrade. Algorithm performance can also degrade within the same environment as practices change or as demographics within that site change. Take, for instance, a policy guiding when physicians should first order a lactate measurement. At 1 point, a lactate measurement might only have been ordered if a patient was suspected of having sepsis. A predictive model trained on data extracted during that period might then learn that a lactate order, rather than the value itself, was predictive of future sepsis. But if lactate becomes ordered more routinely—for example, because of a hospital policy—then the model will have learned an unreliable association that could lead to many false-positive alerts (that is, more people will now have lactate orders, which will cause the sepsis model to alert) (24). Proactive learning algorithms aim to avoid using these types of unstable information. These algorithms proactively avoid learning site-specific biases and are therefore more robust when moved between 2 hospitals (24–26). To correct for bias during learning, they explicitly avoid learning associations that are likely to change or disappear.

Testing and Uncertainty

It is common practice to report overall performance in a population. However, predictive models that perform well on average can perform poorly in important patient subgroups. Therefore, 1 important step to assess model performance is to measure accuracy by using multiple segments of the patient population. Studying performance within these segments can build confidence in a model and might also expose model weaknesses that can be addressed in subsequent iterations of development.
Artificial intelligence–based interventions should also communicate when a prediction is uncertain—that is, likely to be unreliable. For example, it may be beneficial to suppress an alert for sepsis or other conditions if the algorithm is highly uncertain; on the other hand, as additional data are collected and the algorithm becomes more certain, raising an alert is appropriate (27). Such approaches can increase the perceived reliability and trustworthiness of AI-based interventions. A prediction can be uncertain for several reasons. For example, the learning model may not have had exposure to enough samples like the sample in the development data set (model uncertainty). Alternatively, the target outcomes that the model is predicting may be noisy (data uncertainty). Bayesian inference is a common approach for obtaining uncertainty estimates, and new alternatives yield audit tools that can help determine both model and data uncertainty without rerunning the learning algorithm (28–30). The recent tutorial by Saria and Subbaswamy (31) provides a broader introduction to safety and reliability in ML.

Need for Standards in Reporting

The emergence of AI and ML do not avoid the pitfalls of classical predictive models, and the size and complexity of ML models pose new challenges when determining whether a model will generalize from the training data. Although ML researchers are actively working on new ways to interpret and understand the rationale behind model predictions, modern neural networks are still rather opaque. This makes it difficult to, for example, determine whether a model is leveraging site-specific associations that depend on practice patterns, which will probably not generalize to new institutions.
Size and complexity also often limit the ability for peer-reviewed publications to provide readers the opportunity to review model contents in the same detail as they would for traditional logistic predictive models (that is, by listing model coefficients and reporting univariate analyses).
It is also important to report implementation challenges alongside performance metrics. Although analysts can often use ML to obtain impressive results quickly, these methods incur costs of data curation and cleaning initially, and the costs of testing, updating, and governing the tools and data models over time. In the computer science literature, the burden of maintaining ML systems has been labeled “technical debt” (32) and is present even within organizations that have sophisticated computing infrastructures (such as Google). As a result, understanding the incremental benefits (for example, statistical or implementation) of more advanced tools compared with past approaches seems important. For example, did an ML model for readmission get tested against LACE (33) or HOSPITAL (34) scores by using data from the same site? Institutions should be careful to adopt ML systems that have been compared against existing models as a benchmark.
Setting aside the methodology of the underlying AI/ML model, a key issue for quality-improvement reporting relates to how the tool was implemented in clinical care (if at all). What was the implementation approach? Was it clinical decision support (CDS) at the point-of-care (such as a best-practice alert), CDS/data display (such as a risk dashboard/lab-result like), or CDS/nudge-embedded support (such as AI-driven presentation of risk or treatment choice)?

Why Existing Frameworks Are Insufficient

The TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis) criteria, which are the current standard for reporting predictive models (35), are useful for standardizing the reporting of research-derived risk indices but have important limitations when applied to ML models using observational data. A large proportion of the TRIPOD criteria focus on data validity and need to repeat or retest validation to ensure model calibration—needs that carry over to ML models. However, other aspects of the TRIPOD criteria, such as variable choice issues, do not translate to AI and ML approaches. The TRIPOD criteria provide guidance around variable selection that, when carried out using traditional techniques, avoid selection of variables that are collinear, confounded, or have little clinical plausibility. Many of these specific recommendations do not apply to ML models. However, both TRIPOD (35) and the ML community are actively developing new methods and best practices for achieving model stability, bias adjustment, and interpretability.
Another issue is that ML models are typically trained on large observational data sets, which pose many challenges not faced in research-derived data, including data availability, outcome ascertainment, and biases in documentation of predictor and adjuster variables. Thus, adaptation will be needed if such guidelines as TRIPOD are to be applied to evaluations of ML models.

Need for Standards in Tool Reporting

Because AI and ML models are often used as CDS in practice, reporting the actual implementation of these tools should adhere to standard approaches to introducing CDS into health care (36). A nuance to this recommendation occurs when the CDS is derived from a black box, or unsupervised model, and as such may both require additional scrutiny in the governance step as well as a higher level of inspection of the underlying data models before local implementation.
When reporting in the literature and, potentially, when governed locally, several key areas are critical. As a framing step, studies should report whether the AI or ML model was tested side by side with existing tools, if available. Investigators should also be careful not to overemphasize higher C-statistics as a primary measure of incremental benefit.
Data sources and curation approaches are key. Information about the types of data fields considered should be provided, along with how variables were selected for inclusion in the model. At least as important, the study should report which variables came from codified and standardized sources (for example, laboratory results, pulse or blood pressure, International Classification of Diseases codes, Current Procedural Terminology codes), which variables were derived from noncodified sources (for example, natural language processing, image recognition), and how these variables were derived and validated against gold standards. Although it is not clear that the 10 events per predictor rule is a hard-and-fast guideline for logistic models used in sensitivity or secondary models, there may be value in reporting this in AI and ML models, if only to give the reader a sense of model coverage and bias.
Data assumptions, such as how missing data were handled and approaches to imputing data, either as part of the ML or AI model itself or in data curation, are critical. Again, this may become overwhelming in a setting in which dozens of variables are used, but the approach should be specified.
As mentioned, reports should specify how authors avoided pitfalls in observational studies, such as the censoring of data in follow-up or longitudinal models, and ascertainment bias (studies in which patients may enter a model with different amounts or types of data available, such as patients referred to a health system from another). For models seeking not only to assess risk but also to recommend treatments, authors must describe how well treatment and allocation biases inherent in observational data were handled.
Furthermore, authors should report how they accounted for biases that disproportionately affect vulnerable sociodemographic groups, and whether their sample includes a locally or nationally representative sample of these patients.
Multiple investigations have been done of model interpretability for health care applications and CDS, such as RETAIN (37), GBT (38), Dipole (39), and RetainVis (40). Leveraging these can help make model predictions easier for clinicians to use and may make them more likely to accept suggestions.

Issues in Sharing Data and Sharing Code to Make Research More Reproducible

One way to make external validation of new algorithms easier is to make code and data sharing between groups easier, to make work reproducible. However, there is no common data model for health care—in fact, there are many to choose from (41), which complicates things in many ways. Furthermore, sharing data is challenging because of patient privacy regulations. Standard practice is to anonymize the data, but this can be difficult. Sharing code is generally easier but often not as useful as one might expect. Any ML system can be coarsely divided into 2 components: data preparation and learning. Some may find it surprising that the data preparation software is often the more complex component. However, 1 data preparation pipeline is often not useful at other institutions. For example, deployments of the electronic health record (EHR) system Epic are customized to each institution. A data preparation pipeline built to ingest data from one Epic deployment may not be useful at another institution, even if it is also using Epic, where the data are stored by using different structures (42–44).

Adopting AI or ML at an Institution

Most organizations will want to begin implementing AI or ML sometime soon. However, even for organizations that have already adopted EHRs, this undertaking is far from trivial. One major roadblock is that many EHRs include specific items, such as code status, in as many as 25 different fields, so reconciliation is problematic. Another problem is that EHRs have not generally been designed to allow extraction of real-time data, so that predicting an event, such as sepsis or clinical decompensation, requires setting up a new database. Every organization will need to address issues around data preparation, how to get suggestions to clinicians in useful ways, and then getting clinicians involved so that they act on suggestions.

Data Preparation

Data preparation pipelines typically extract data from an institution's EHR system and transform it into a format that can then be consumed by a learning algorithm. These software systems are complex, and it is easy to make programming errors. An important issue when building reliable ML systems is to check the veracity of the extracted data. For example, the rules in an extraction pipeline could miss medication orders that are placed by the pharmacy and not by providers. Another common mistake is when there are inconsistencies in the units used to record test results or medication dosages.
There are currently no techniques that we are aware of to preclude such errors. One strategy to help weed these errors out, however, is to perform simple sanity checks on aggregate statistics, such as counts and averages, and verify that they match expectations. This might help catch, for instance, an unusually low number of orders extracted for a medication. A more formal strategy is to define a data schema that encodes expected properties of individual pieces of data and of aggregate statistics (45). For example, one might expect that dosages lie within a certain range. Data schemas are used to automatically check when data violate expectations and can help to identify examples of the violation. Example violations are useful to help track down bugs in the software pipeline. In some cases, they may lead to revision of the schema and better understanding of the data.
Another challenge that must be addressed when extracting data for a ML system is to carefully define a useful target for the model to predict. The criteria used to define many important adverse events are working models based on the medical community's current understanding of a disease. The implication for a ML system is that the value of the model's predictions is only as good as the underlying criteria used to define the target. For example, acute kidney injury is an important critical illness with 2 recent definitions: RIFLE (46) and KDIGO (47). Septic shock, a condition that has received considerable attention from ML researchers, has been redefined several times over the past 2 decades. Such criteria as KDIGO are good starting points for defining target variables in a ML system, but they must often be augmented with additional criteria that help to better align the target variables with the purpose of the system.

Getting Suggestions to Clinicians in Useful Ways

Obtaining benefit from CDS in general demands that providers receive suggestions at specific times in their workflow and in ways that help them make better decisions (48, 49). This will certainly be true for AI- and ML-related suggestions. An example would be a provider who is admitting a patient from the emergency department and needs to decide what level of care the patient should go to. In addition to being delivered in a timely way, it is critical how normative a suggestion is, and whether it interrupts the clinician in his or her workflow. In general, with AI, the goal is to advise the clinician, including an assessment about uncertainty regarding a prediction, but to leave the choice up to him or her.

Clinician Involvement

Clinician involvement in model building represents an important check on variable plausibility and underlying biases; it may also be an important approach to understanding issues around documentation biases, and this might also reduce the influence of sociodemographic biases in care. The practical approach to including clinicians in model building remains to be seen, however, and may best be applied at the time the model is being considered for adoption (for example, during governance). Allowing interpretability and direct inspection is of growing interest, with new approaches being developed.

Open Challenges

Many challenges in using AI and ML still need to be addressed. Tools for improving model reliability are paramount. One is that the dominant paradigm in ML is statistical and is formulated primarily in terms of generalization to the distribution from which the training data were gathered. As discussed in earlier sections of this article, models in health care need to generalize beyond the training data distribution (for example, because a hospital serves a different population from the one that gathered the data, or because policies at a hospital change).

Role of Regulation

The FDA has primary authority to regulate software, but it generally has taken a light hand in this area. This is mainly because many of its usual approaches, which have worked well for drugs and devices, often do not generalize to software because software must evolve so rapidly, and more regulation could stifle innovation. It has elected to regulate more tightly software in devices that make normative recommendations—for example, about the insulin dosage for type 1 diabetes.
The development of ML poses a new challenge to current regulatory frameworks, specifically because of how it evolves, and because many of the methods are not readily explainable. Similar evolution will be needed for new technologies, such as driverless cars. Even if they prove safer overall than cars with drivers, when crashes occur backlash is likely. In health care, for some time there is likely to be substantial reliance on humans to filter the recommendations of ML, which utilizes the learned intermediary model.
In conclusion, although AI and ML have huge potential in health care, the hype regarding their potential benefits has been enormous and it will be crucial to measure what the incremental benefits associated with implementing these approaches are. To this end, development of methodology to enable such assessments will be critical. Specific methodological issues apply in this area, which are enumerated in this article, and it will be pivotal to have more robust standards for reporting evaluations of these technologies and make suggestions about them. Finally, organizations will need specific infrastructure, which must be not only technical but also sociotechnical so that providers leverage the technologies to improve care and to minimize their potential for harm.

References

1.
Parikh RBKakad MBates DW. Integrating predictive analytics into high-value care: the dawn of precision delivery. JAMA. 2016;315:651-2. [PMID: 26881365]  doi: 10.1001/jama.2015.19417
2.
Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44-56. [PMID: 30617339]  doi: 10.1038/s41591-018-0300-7
3.
Bates DWSaria SOhno-Machado Let al. Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Aff (Millwood). 2014;33:1123-31. [PMID: 25006137]  doi: 10.1377/hlthaff.2014.0041
4.
Saria SButte ASheikh A. Better medicine through machine learning: what's real, and what's artificial? [Editorial]. PLoS Med. 2018;15:e1002721. [PMID: 30596635]  doi: 10.1371/journal.pmed.1002721
5.
Kim KG. Book review: Deep Learning. Healthc Inform Res. 2016;22:351.  doi: 10.4258/hir.2016.22.4.351
6.
Gultepe EGreen JPNguyen Het al. From vital signs to clinical outcomes for patients with sepsis: a machine learning basis for a clinical decision support system. J Am Med Inform Assoc. 2014;21:315-25. [PMID: 23959843]  doi: 10.1136/amiajnl-2013-001815
7.
Calvert JSPrice DAChettipally UKet al. A computational approach to early sepsis detection. Comput Biol Med. 2016;74:69-73. [PMID: 27208704]  doi: 10.1016/j.compbiomed.2016.05.003
8.
Desautels TCalvert JHoffman Jet al. Prediction of sepsis in the intensive care unit with minimal electronic health record data: a machine learning approach. JMIR Med Inform. 2016;4:e28. [PMID: 27694098]
9.
Henry KEHager DNPronovost PJet al. A targeted real-time early warning score (TREWScore) for septic shock. Sci Transl Med. 2015;7:299ra122. [PMID: 26246167]  doi: 10.1126/scitranslmed.aab3719
10.
Strickland E. Hospitals roll out AI systems to keep patients from dying of sepsis. IEEE Spectrum. 19 October 2018. Accessed at https://spectrum.ieee.org/biomedical/diagnostics/hospitals-roll-out-ai-systems-to-keep-patients-from-dying-of-sepsis on 11 February 2019.
11.
Johnson AEPollard TJShen Let al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035. [PMID: 27219127]  doi: 10.1038/sdata.2016.35
12.
Churpek MMYuen TCWinslow Cet al. Multicenter comparison of machine learning methods and conventional regression for predicting clinical deterioration on the wards. Crit Care Med. 2016;44:368-74. [PMID: 26771782]  doi: 10.1097/CCM.0000000000001571
13.
Finlay GDRothman MJSmith RA. Measuring the modified early warning score and the Rothman index: advantages of utilizing the electronic medical record in an early warning system. J Hosp Med. 2014;9:116-9. [PMID: 24357519]  doi: 10.1002/jhm.2132
14.
Clifton LClifton DAWatkinson PJet al. Identification of patient deterioration in vital-sign data using one-class support vector machines [Abstract]. In: Ganzha M, Maciaszek L, Paprzycki M, eds. Proceedings of the Federated Conference on Computer Science and Information Systems, Szczecin, Poland, 18–21 September 2011. Los Alamitos, CA: IEEE Computer Soc Pr; 2011:125-31.
15.
Piper GLKaplan LJMaung AAet al. Using the Rothman index to predict early unplanned surgical intensive care unit readmissions. J Trauma Acute Care Surg. 2014;77:78-82. [PMID: 24977759]  doi: 10.1097/TA.0000000000000265
16.
Johnson AEGhassemi MMNemati Set al. Machine learning and decision support in critical care. Proc IEEE Inst Electr Electron Eng. 2016;104:444-66. [PMID: 27765959]
17.
Marschollek MGövercin MRust Set al. Mining geriatric assessment data for in-patient fall prediction models and high-risk subgroups. BMC Med Inform Decis Mak. 2012;12:19. [PMID: 22417403]  doi: 10.1186/1472-6947-12-19
18.
Ojetola O, Gaura EI, Brusey J. Fall detection with wearable sensors—SAFE (SmArt Fall dEtection). Proceedings of the Seventh International Conference on Intelligent Environments. 2011;318-21. doi: 10.1109/IE.2011.38
19.
Ni B, Dat NC, Moulin P. RGBD-camera based get-up event detection for hospital fall prevention [Abstract]. Presented at 37th International Conference on Acoustics, Speech, and Signal Processing, Kyoto, Japan, 25–30 March 2012. Accessed at www.ifp.illinois.edu/~moulin/Papers/ICASSP12-BB.pdf on 5 February 2019.
20.
Pannurat NThiemjarus SNantajeewarawat E. Automatic fall monitoring: a review. Sensors (Basel). 2014;14:12900-36. [PMID: 25046016]  doi: 10.3390/s140712900
21.
U.S. Food and Drug Administration. FDA permits marketing of artificial intelligence-based device to detect certain diabetes-related eye problems. 11 April 2018. Accessed at www.fda.gov/newsevents/newsroom/pressannouncements/ucm604357.htm on 5 February 2019.
22.
Abràmoff MDLavin PTBirch Met al. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. NPJ Digit Med. 2018;1:39. [PMID: 31304320]  doi: 10.1038/s41746-018-0040-6
23.
Collins GSde Groot JADutton Set al. External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol. 2014;14:40. [PMID: 24645774]  doi: 10.1186/1471-2288-14-40
24.
Schulam PSaria S. Reliable decision support using counterfactual models [Abstract]. In: Bengio S, Wallach H, Larochelle H, et al, eds. Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, California, 4–9 December 2017. San Diego: Neural Information Processing Systems; 2017.
25.
Subbaswamy A, Saria S. Counterfactual normalization: proactively addressing dataset shift using causal mechanisms. Uncertainty in Artificial Intelligence. 2017. Accessed at https://arxiv.org/pdf/1808.03253.pdf on 27 December 2018.
26.
Subbaswamy A, Schulam P, Saria S. Preventing failures due to dataset shift: learning predictive models that transport [Abstract]. In: Chaudhuri K, Sugiyama M, eds. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Okinawa, Japan, 16–18 April 2019. Accessed at http://proceedings.mlr.press/v89/subbaswamy19a/subbaswamy19a.pdf on 19 June 2019.
27.
Soleimani HHensman JSaria S. Scalable joint models for reliable uncertainty-aware event prediction. IEEE Trans Pattern Anal Mach Intell. 2018;40:1948-63. [PMID: 28841550]  doi: 10.1109/TPAMI.2017.2742504
28.
Lakshminarayanan BPritzel ABlundell C. Simple and scalable predictive uncertainty estimation using deep ensembles [Abstract]. In: Bengio S, Wallach H, Larochelle H, et al, eds. Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, California, 4–9 December 2017. San Diego: Neural Information Processing Systems; 2017:6402-13.
29.
Schulam P, Suchi S. Can you trust this prediction? Auditing pointwise reliability after learning [Abstract]. In: Chaudhuri K, Sugiyama M, eds. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Okinawa, Japan, 16–18 April 2019. Accessed at http://proceedings.mlr.press/v89/schulam19a/schulam19a.pdf on 19 June 2019.
30.
Guo C, Pleiss G, Sun Y, et al. On calibration of modern neural networks [Abstract]. In: Precup D, Teh YW, eds. Proceedings of the 34th International Conference on Machine Learning, Sydney, New South Wales, Australia, 6–11 August 2017. Accessed at https://arxiv.org/pdf/1706.04599.pdf on 5 February 2019.
31.
Saria S, Subbaswamy A. Tutorial: safe and reliable machine learning [Abstract]. Presented at 2019 ACM Conference on Fairness, Accountability, and Transparency (FAT* 2019), Atlanta, Georgia, 29–31 January 2019. Accessed at https://arxiv.org/pdf/1904.07204.pdf on 11 September 2019.
32.
Sculley DHolt GGolovin Det al. Hidden technical debt in machine learning systems [Abstract]. In: Cortes C, Lawrence ND, Lee DD, et al, eds. Proceedings of the 28th Conference on Neural Information Processing Systems, Montreal, Quebec, Canada, 7–12 December 2015. San Diego: Neural Information Processing Systems; 2017.
33.
van Walraven CDhalla IABell Cet al. Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the community. CMAJ. 2010;182:551-7. [PMID: 20194559]  doi: 10.1503/cmaj.091117
34.
Donzé JAujesky DWilliams Det al. Potentially avoidable 30-day hospital readmissions in medical patients: derivation and validation of a prediction model. JAMA Intern Med. 2013;173:632-8. [PMID: 23529115]  doi: 10.1001/jamainternmed.2013.3023
35.
Collins GSMoons KGM. Reporting of artificial intelligence prediction models. Lancet. 2019;393:1577-9. [PMID: 31007185]  doi: 10.1016/S0140-6736(19)30037-6
36.
Tcheng JEBakken SBates DWet al. Optimizing Strategies for Clinical Decision Support: Summary of a Meeting Series. Washington, DC: National Academies Pr; 2017.
37.
Choi EBahadori MTSun Jet al. RETAIN: an interpretable predictive model for healthcare using reverse time attention mechanism [Abstract]. In: Lee DD, Sugiyama M, Luxburg UV, et al, eds. Proceedings of the 29th Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. San Diego: Neural Information Processing Systems; 2016:3504-12.
38.
Che ZPurushotham SKhemani Ret al. Interpretable deep models for ICU outcome prediction. AMIA Annu Symp Proc. 2016;2016:371-80. [PMID: 28269832]
39.
Ma FChitta RZhou Jet al. Dipole: diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks [Abstract]. In: Matwin S, Yu S, Farooq F, eds. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Nova Scotia, Canada, 13–17 August 2017. New York: Association for Computing Machinery; 2017:1903-11.  doi: 10.1145/3097983.3098088
40.
Kwon BCChoi MJKim TJet al. RetainVis: visual analytics with interpretable and interactive recurrent neural networks on electronic medical records. IEEE Trans Vis Comput Graph. 2019;25:299-309.
41.
Garza MDel Fiol GTenenbaum Jet al. Evaluating common data models for use with a longitudinal community registry. J Biomed Inform. 2016;64:333-41. [PMID: 27989817]  doi: 10.1016/j.jbi.2016.10.016
42.
Announcement: where are the data? [Editorial]. Nature. 2016;537:138. [PMID: 27604913]  doi: 10.1038/537138a
43.
Choi E, Biswal S, Malin B, et al. Generating multi-label discrete patient records using generative adversarial networks [Abstract]. In: Doshi-Velez F, Fackler J, Kale D, et al, eds. Proceedings of Machine Learning for Healthcare 2017, Boston, Massachusetts, 18–19 August 2017. Accessed at https://arxiv.org/pdf/1703.06490.pdf on 26 December 2018.
44.
Baowaly MKLin CCLiu CLet al. Synthesizing electronic health records using improved generative adversarial networks. J Am Med Inform Assoc. 2019;26:228-41. [PMID: 30535151]  doi: 10.1093/jamia/ocy142
45.
Baylor DBreck ECheng HTet al. TFX: a tensorflow-based production-scale machine learning platform [Abstract]. In: Matwin S, Yu S, Farooq F, eds. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Nova Scotia, Canada, 13–17 August 2017. New York: Association for Computing Machinery; 2017:1387-95.  doi: 10.1145/3097983.3098021
46.
Ricci ZCruz DRonco C. The RIFLE criteria and mortality in acute kidney injury: a systematic review. Kidney Int. 2008;73:538-46. [PMID: 18160961]
47.
Kidney Disease: Improving GlobalOutcomes (KDIGO) Acute Kidney Injury Work Group. KDIGO clinical practice guideline for acute kidney injury. Kidney Int Suppl. 2012;2:1-138.  doi: 10.1038/kisup.2012.1
48.
Bates DWKuperman GJWang Set al. Ten commandments for effective clinical decision support: making the practice of evidence-based medicine a reality. J Am Med Inform Assoc. 2003;10:523-30. [PMID: 12925543]
49.
Emanuel EJWachter RM. Artificial intelligence in health care: will the value match the hype? JAMA. 2019;321:2281-2. [PMID: 31107500]  doi: 10.1001/jama.2019.4914

Comments

0 Comments
Sign In to Submit A Comment

Information & Authors

Information

Published In

cover image Annals of Internal Medicine
Annals of Internal Medicine
Volume 172Number 11_Supplement2 June 2020
Pages: S137 - S144

History

Published online: 2 June 2020
Published in issue: 2 June 2020

Keywords

Authors

Affiliations

David W. Bates, MD, MSc
Brigham and Women's Hospital, Boston, Massachusetts (D.W.B., A.W.)
Andrew Auerbach, MD
University of California, San Francisco, California (A.A.)
Peter Schulam, PhD
Whiting School of Engineering, Baltimore, Maryland (P.S., S.S.)
Adam Wright, PhD
Brigham and Women's Hospital, Boston, Massachusetts (D.W.B., A.W.)
Suchi Saria, PhD
Whiting School of Engineering, Baltimore, Maryland (P.S., S.S.)
Financial Support: By the Gordon and Betty Moore Foundation, Palo Alto, California.
Disclosures: Dr. Bates reports grants from the Gordon and Betty Moore Foundation during the conduct of the study, and grants from EarlySense and IBM Watson Health and personal fees from EarlySense CDI (Negev), Ltd, ValeraHealth, Clew, and MDClone outside the submitted work. Dr. Schulam reports personal fees from Bayesian Health outside the submitted work. Dr. Saria reports grants from the Gordon and Betty Moore Foundation, National Science Foundation, National Institutes of Health, Defense Advanced Research Projects Agency (DARPA), and the American Heart Association outside the submitted work. Dr. Saria is a founder of and holds equity in Bayesian Health. This arrangement has been reviewed and approved by the Johns Hopkins University in accordance with its conflict of interest policies. She is the scientific advisory board member for PatientPing. She has received honoraria for talks from a number of biotechnology, research, and health technology companies. She has no direct ties to the companies mentioned in this article. Authors not named here have disclosed no conflicts of interest. Disclosures can also be viewed at www.acponline.org/authors/icmje/ConflictOfInterestForms.do?msNum=M19-0872.
Corresponding Author: David W. Bates, MD, MSc, Division of General Internal Medicine, Brigham and Women's Hospital, 1620 Tremont Street, Boston, MA 02115.
Current Author Addresses: Dr. Bates: Division of General Internal Medicine, Brigham and Women's Hospital, 1620 Tremont Street, Boston, MA 02115.
Dr. Auerbach: University of California, San Francisco, 533 Parnassus Avenue, San Francisco, CA 94117.
Dr. Schulam: Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218.
Dr. Wright: Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Avenue, Suite 1475, Nashville, TN 37203.
Dr. Saria: Malone Hall #333, 3400 North Charles Street, Baltimore, MD 21218.
Author Contributions: Conception and design: D.W. Bates, A. Auerbach, A. Wright, S. Saria.
Drafting of the article: D.W. Bates, A. Auerbach, P. Schulam, S. Saria.
Critical revision of the article for important intellectual content: D.W. Bates, A. Auerbach, A. Wright, S. Saria.
Final approval of the article: D.W. Bates, A. Auerbach, A. Wright.
Statistical expertise: A. Wright.Obtaining of funding: D.W. Bates, A. Auerbach.Administrative, technical, or logistic support: D.W. Bates, A. Auerbach.
This article is part of the Annals supplement “Implementing, Studying, and Reporting Health System Improvement in the Era of Electronic Health Records.” The Moore Foundation (contract number 7107) provided funding for publication of this supplement. Andrew D. Auerbach, MD, MPH (University of California, San Francisco); David W. Bates, MD, MSc (Brigham and Women's Hospital, Harvard Medical School, and Harvard School of Public Health); Jaya K. Rao, MD, MHS (Annals Deputy Editor); and Christine Laine, MD, MPH (Annals Editor in Chief), served as editors for this supplement.

Metrics & Citations

Metrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. For an editable text file, please select Medlars format which will download as a .txt file. Simply select your manager software from the list below and click Download.

For more information or tips please see 'Downloading to a citation manager' in the Help menu.

Format





Download article citation data for:
David W. Bates, Andrew Auerbach, Peter Schulam, et al. Reporting and Implementing Interventions Involving Machine Learning and Artificial Intelligence. Ann Intern Med.2020;172:S137-S144. [Epub 2 June 2020]. doi:10.7326/M19-0872

View More

Get Access

Login Options:
Purchase

You will be redirected to acponline.org to sign-in to Annals to complete your purchase.

Access to EPUBs and PDFs for FREE Annals content requires users to be registered and logged in. A subscription is not required. You can create a free account below or from the following link. You will be redirected to acponline.org to create an account that will provide access to Annals. If you are accessing the Free Annals content via your institution's access, registration is not required.

Create your Free Account

You will be redirected to acponline.org to create an account that will provide access to Annals.

View options

PDF/ePub

View PDF/ePub

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share on social media