Article Text
Abstract
AIM To predict the individual neonatal mortality risk of preterm infants using an artificial neural network “trained” on admission data.
METHODS A total of 890 preterm neonates (<32 weeks gestational age and/or <1500 g birthweight) were enrolled in our retrospective study. The neural network trained on infants born between 1990 and 1993. The predictive value was tested on infants born in the successive three years.
RESULTS The artificial neural network performed significantly better than a logistic regression model (area under the receiver operator curve 0.95 vs 0.92). Survival was associated with high morbidity if the predicted mortality risk was greater than 0.50. There were no preterm infants with a predicted mortality risk of greater than 0.80. The mortality risks of two non-survivors with birthweights >2000 g and severe congenital disease had largely been underestimated.
CONCLUSION An artificial neural network trained on admission data can accurately predict the mortality risk for most preterm infants. However, the significant number of prediction failures renders it unsuitable for individual treatment decisions.
- artificial neural network
- mortality
- prediction
Statistics from Altmetric.com
Mortality has traditionally been used to compare the therapeutic performances of neonatal intensive care units.1-3 As individual mortality risk depends on numerous confounding factors, crude mortality can be adjusted for prematurity,4 5 birthweight,1 6 initial illness severity, 7-9 presence of congenital malformations, or a combination of these factors, for the purposes of quality assessment.2 3 7
The Score for Neonatal Acute Physiology (SNAP), Perinatal Extension (SNAP-PE),7 Clinical Risk Index for Babies (CRIB) score7-9 and alternative models10 are excellent predictors of overall inpatient mortality of preterm neonates if accuracy is judged by the area under the receiver operating characteristic (ROC) curve.7-9 11 Despite such excellent results a significant number of individual prediction failures still occur. Case analysis of non-survivors with a low predicted individual mortality risk, or of survivors with a very high predicted individual mortality risk might yield further insight into the pitfalls of scoring systems.
As the quality of early therapeutic interventions might significantly influence the degree of illness severity,12 data on initial illness severity are compromised if collected later than a few hours after birth.7-9 Thus several variables contributing to CRIB or SNAP scores are regarded as input as well as output data with respect to intensive care unit performance.7-9 11Shortening the data collection period is desirable.7-9
Artificial neural networks (ANN) are software tools with the capacity to learn. An ANN behaves like a child learning to differentiate between cats and dogs, by means of examples, under the supervision of his/her parents. An ANN can learn the relation between input variables (size, fleece, voice, behaviour patterns) and the output variable (cat or dog) by presenting multiple input–output pairs to it (supervised learning). After this learning or “training” period the ANN can “predict” the output (cat or dog) on inputs of further unknown examples.13 14 This capability is called generalisation. ANNs have shown excellent predictive accuracy in medicine even on inaccurate or incomplete input data. Recent reviews describe the many clinical applications of ANN.15-20
Using an ANN for admission data on preterm neonates we aimed to: predict the individual neonatal mortality risk of preterm neonates from their admission data; compare the net’s performance with that of a logistic regression model; and characterise survivors despite a high predicted individual mortality risk, and non-survivors despite a low predicted individual mortality risk.
Methods
Our investigation was based on a pre-existing data set of 890 preterm neonates (gestational age < 32 completed weeks gestation and/or birthweight < 1500 g) born between 4 April 1990 to 24 October 1996. Admission and outcome data were available on computer files (tables 1 and 2). Outcome was defined with respect to neonatal death (death within the first 28 days of life).
Comparison of training and test set and validation set patients (binary scaled input and output variables)
Comparison of training and test set and validation set patients (continuous input variables) (5%, 95% centiles in parentheses)
The patient sample was split into a training and test (n=435) and a validation set (n=455) according to date of birth. The cutoff date was arbitrarily set to 5 November 1993. The training and test set data were subjected to a forward stepwise logistic regression analysis (cutoff p value to include a variable Pin=0.05, cutoff p value to exclude a variable Pout=0.1) (SPSS/PC release 6.1.2, SPSS Inc., Chicago, IL, USA), and were used to train the ANN (Predict, NeuralWare Inc., Pittsburgh, USA). Predict helped to develop a feed–forward fully connected three layer perceptron. The descent gradient learning rule was applied for error propagation.21 22 The ANN input variable selection mode implemented in Predict is based on genetic algorithms and helps to find the optimal combination of inputs.21 The training process was stopped as soon as the prediction error on a subset of the training data set, namely the test set, stopped declining further. After the training process the ANN had settled to 13 input and three hidden processing elements. There was one output processing element.
For validation the inputs were derived from the independent validation data set (6 November 1993 to 24 October 1996). The logistic regression model and the ANN delivered the individual mortality risk of each single preterm neonate. Prediction and actual outcome were compared to assess performance.
For the purpose of performance assessment and method comparison ROC (receiver operating characteristic) curves were used. The outcome predictions as delivered by both models were continuous variables (individual mortality risk). The cutoff point used to transform them into dichotomous mortality predictions was varied to obtain ROC curves.23 24 The maximum likelihood estimation of the binormal ROC curves, and the area under the fitted ROC curves were calculated using LABROC1 and CLABROC software (IBM-PC version 1.2.1; Metz CE et al. 1993, Department of Radiology, Chicago Medical Center, Chicago, IL, USA). CLABROC was used to test corresponding ROC curves for any significant difference (univariate Z-score test of the difference between the areas under the two paired ROC curves, and univariate Z-score test of the difference between sensitivity values on the two paired ROC curves at a selected specificity level25).
Surviving patients with an ANN based individual mortality risk of more than 0.50 were matched with infants identical in gestational age and birthweight (within ±5%), but with a predicted individual mortality risk of less than 0.50. Morbidity of the two groups was compared with respect to the variables listed in table 3. Non-survivors with very low predicted individual mortality risk (<0.10) were analysed individually.
Matched pairs analysis of survivors with high (>0.5) or low (<0.5) predicted individual mortality risk*
To test for associations between the variables in the training or validation set univariate tests were performed (χ2 test for categorical variables; t test, or Mann-Whitney U test for continuous variables). A p value of < 0.05 was considered significant.
Results
The training data set overall neonatal mortality was 8.7%, while in the validation set it was 7.9%. The difference is not significant (table 1). The validation dataset differed in several respects from the training set (table 1 and 2) mainly because of the higher rate of inborn infants in the validation time period.
Stepwise logistic regression revealed the items low gestational age, one minute Apgar score < 5, pH of the first blood gas analysis < 7.10, a life threatening condition on admission, and the presence of congenital malformation, as significant risk factors for neonatal mortality. The factor stable condition on admission was included in the model by the forward stepwise algorithm of SPSS. This factor did not reach significance when the odds ratio was calculated (p = 0.066) (table4).
Logistic regression model for six predictor variables
The optimal ANN used 13 different input variables (table 5). To investigate the specific impact of individual ANN input variables on neonatal mortality prediction, 14 ANNs were trained, each omitting one or two input variables. An additional ANN was trained with just two inputs—namely, birthweight and gestational age. The resulting areas under the ROC curves reflecting each ANN’s performances are given in table 5.
ANN inputs and area under ROC curves of ANNs with reduced input variable set5-150; comparison to ANN with complete input set
When adjusted for a specificity level of 80, 85, 90, or 95%, in terms of sensitivity, the predictions of the ANN were significantly better than the logistic regression model (table 6). The ANN area under the ROC curve was 0.954 (SD 0.011), which is larger than the 0.917 (SD 0.017) of the logistic regression model area ( p = 0.002, area test)(fig 1).
Comparison of prediction accuracy of ANN and logistic regression models
ROC plots: ANN compared with the logistic regression model.
No preterm neonate had an ANN predicted mortality risk of >0.80. Thirteen (68.4%) of the 19 preterm neonates with an individual predicted mortality risk of >0.70 died in the neonatal period. Of the 67 preterm babies with a predicted mortality risk of >0.5, 36 (53.7%) died before 29 days of life and three thereafter (fig 2).
Predicted and observed neonatal mortality: solid column represents ANN; open column represents logistic regression analysis. Absolute numbers of fatal (numerator) and total cases (denominator) of the corresponding patient group are shown.
When morbidity was graded according to grossly abnormal brain ultrasound scans, or to the number of postnatally acquired diseases requiring one or more surgical interventions, survivors with an ANN based mortality risk >0.50 had a significant higher morbidity than matched survivors with an ANN based mortality risk < 0.50 (p=0.02)(table 3).
The clinical characteristics, cause of death, and the individual ANN derived mortality risk of the three low risk non-survivors (predicted mortality risk < 0.10) are summarised in table 7.
Case analysis of non-survivors with ANN predicted indivdual mortality risk < 0.10
Case 1, a preterm infant of 27 weeks gestation, in his first hours of life, developed respiratory distress syndrome that was unresponsive to surfactant rescue treatment. Cultures on blood and tracheal fluid grewEscherichia coli. The infant died on his third day of life, presenting with septic shock syndrome.
Case 2 had a positive history with respect to neonatal death in three maternal siblings. Polyhydramnios had been present in each case, and all the mother’s siblings had died of respiratory insufficiency. The mother turned out to have myotonic dystrophia. Her infant was in need of aggressive stabilisation immediately after birth, including adrenalin and sodium bicarbonate. After stabilisation the infant was transported to neonatal intensive care. On admission her capillary pH was within an acceptable range (table 7).
Case 3 had complex cardiac malformation (double outlet right ventricle, hypoplastic pulmonary artery, pulmonary valve stenosis) and respiratory distress syndrome grade IV. Blood pressure and blood pH could be stabilised for only a short period.
In the latter two cases the individual predicted mortality risks increased when predictions were obtained from an ANN trained without information on gestational age and birthweight (table 7).
Discussion
Artificial neural networks have been widely used for outcome prediction.16 Mortality has been predicted with great accuracy for patients on intensive care units, and for patients who received cardiopulmonary resuscitation in hospital. (Doig GS, et al. Proceedings of annual symposium on computers applied to medical care,1993: 361-65).26 27
Solely based on routine data for the first minutes of life, our ANN was capable of accurately predicting individual neonatal mortality risk in preterm neonates despite changes in patient characteristics between the training and validation time period. The ANN performed better than a logistic regression model.
While the logistic regression model comprised six items, the ANN used 13 items (table 5). The input variable selection of the ANN seems reasonable as most selected variables were reportedly associated with mortality1-4 7-9 28-32 and as all predictive variables isolated by the stepwise logistic regression analysis were also included in the ANN input variable set. Hence subtle and/or non-linear associations between input variables and neonatal mortality not detected by logistic regression analysis might account for the superior performance of the ANN.
Gestational age, birthweight, and condition on admission are the only items that must not be omitted during the training process without severely compromising performance (table 5). The special impact of the first two of those items on neonatal mortality risk prediction is illustrated by the fact that the area under the ROC curve of the ANN trained with just those two items (table 5) is larger than comparable reports suggest.7-10 We cannot exclude that our neonatologists on duty underestimated the gestational age of some of the neonates, including more mature and “healthier” infants (who are at almost no risk for neonatal mortality and where mortality risk is easier to predict) into our study population. Any comparable studies focused on prediction of inpatient mortality (table 8). Thus neonatal mortality may be easier to predict than inpatient mortality.
Comparison of different mortality prediction models
The method to find the optimal ANN input variable set still awaits standardisation despite recent advances in the field.33 34 It seems reasonable to use pruning or genetic algorithms for that purpose. We relied on the intrinsic variable selection mode of Predict,21 but—a feature inherent in genetic algorithms—we cannot be absolutely sure that we have found the variable set with the maximum predictive power.35
There is no easy way to assess the relative impact of the individual ANN input variables. We favoured a pragmatic approach by training several ANNs, subsequently omitting one input of the optimal input variable set, and comparing the different performances.
Twenty five survivors with a high (> 0.50) ANN derived individual mortality risk underwent a matched pairs analysis with counterparts of lower risk (< 0.50) (table 3). All 50 preterm neonates were very immature (median gestational age 26.3 weeks, median birthweight about 880 g), and most of them survived but at the expense of major sequelae. The ANN identified a subgroup of preterm neonates with significantly higher morbidity when they survived. Individuals in this subgroup presenting with a predicted individual neonatal mortality risk of > 0.50 had a higher rate of cerebral complications, and developed more conditions that eventually required surgical intervention. Those findings agree with those of recent studies on illness severity scores (SNAP, CRIB) where a positive association between high scoring and the presence of a major cerebral abnormality on ultrasound brain scan was found.7 11 36
Although overall prediction of neonatal mortality is accurate enough for the purpose of intrahospital quality management and risk stratification, for the purpose of individual non-treatment decisions, scoring systems as well as the ANN still fail to differentiate between high risk preterm neonates who will eventually die and those who will survive for at least 28 days. Important factors contributing to preterm mortality risk were probably not included in the development of established scoring systems and in the training data of our ANN. And perhaps those contributing factors (for example, antioxidant status, etc) have not been measured yet.
One probable reason for the prediction failures in cases 2 and 3 of low risk non-survivors is the tendency of the ANN to underestimate the mortality risk the heavier and more mature the babies are. Our database lacked important information on physiological measures such as cord blood pH value, oxygen requirement in the first hours of life, or the presence of signs of infection. Hence the ANN overestimated the impact of immaturity on mortality. In case 1 sepsis started during day 1 of life and the baby died on day 3. The data collecting period required to detect physiological derangements would have been 12 to 24 hours—the same as in the CRIB or SNAP scores. Our admission data did not adequately reflect individual mortality risk in this case.
Several models that can predict mortality in preterm neonates have been developed.1-3 7-9 37 A performance comparison of these models proved difficult as the studies differ in respect to authors’ intention, study population, number of institutions enrolled, data collecting period, items used to characterise the neonate, primary endpoint, and general applicability (table 8).
Although high impact items were missing in our database, and data collecting was restricted to the prenatal and immediate perinatal periods, the ANN’s performance is equivalent to established scoring systems. In contrast to most other mortality prediction models, we have opted for neonatal mortality instead of inpatient mortality as the endpoint. The reason for this is that individual diseases contributing substantially to mortality after 28 days (late necrotising enterocolitis, central catheter sepsis, respiratory syncytial virus infection, etc.) are unlikely to be associated with admission data. It should be mentioned that in contrast to published reports,2 3 surfactant treatment and high frequency oscillation, both of which can influence mortality,38-40had already been introduced into routine care during the study period.
We relied on pre-existing admission data. We feel that information on important topics was missing in our database—for example, illness severity in the first hours of life. One of the items used was a very subjective measure of illness severity—the assessment of the overall condition on admission, as performed by the physician on duty. Stevens et al 31 and Richardson et al 8 reported that the physician’s and nurse’s estimation of neonatal mortality risk was highly correlated with the actual mortality risk, or with the SNAP score. The assessment of the overall condition on admission had not been intended to serve as an estimate of mortality risk at the time our data were recorded. Notably, an ANN trained with those subjective assessments made by 30 different physicians delivered reliable results when fed with independent data of the subsequent validation time period, when overall condition on admission was assessed by another 30 physicians. The use of this kind of subjective input variable is a major limitation of the present study. Efforts are under way to substitute those variables with physiological measures, and to broaden the database. The aim is to train an ANN exclusively on objectively measurable physiological data, to render it broadly applicable to other neonatal intensive care units.
A feature intrinsic to an ANN is its retraining potential in case of advances in medical care, or simply in case of additional data becoming available. In contrast to multiple regression analysis, expanding the database allows the ANN to improve its knowledge using non-linear relations between input and output data, whereas multiple regression at some point reaches its intrinsic limitations.20 With multiple regression, the limit of accuracy is determined by two factors—namely, “noise” (missing, inaccurate, or false data), and the degree of non-linearity between the input and outcome variables. With the ANN approach it is noise alone that determines the limit of accuracy. As Richardson and Tarnow-Mordi have pointed out,11 “ as sophisticated systems for automatic acquisition of large quantities of routine clinical information become available, emphasis may shift in favour of more complex clinical scoring systems to maximise predictive accuracy.” Clearly, the ability of an ANN to use a very large number of variables from different sources (parameters, settings, and monitoring data from mechanical ventilators, ECG signals or clinical chemistry data) is a distinct advantage. Neural networks are able to include both quantitative and qualitative data into the same model. There are no limitations with respect to ordinal scaled data as there are in logistic regression analysis.
With the recent advances in hardware and software any clinician is able to develop ANNs to depersonalise his/her experience, and to make it accessible to junior colleagues. This can even happen under circumstances where rules are difficult to formulate and situations are too complex to be analysed by classic statistics.16
Trained ANNs predicting morbidity and mortality might help to assure high quality of care through comparison of predicted and actual outcome; to assess the time course of intrahospital advances in treatment; and to warn physicians and nurses, when individual infants are at high risk and deserve intensified attention. But prediction failures tell us that neonatal intensive care medicine is too complex to base an individual no-treatment policy on a prediction that might turn out to be very inaccurate for individual patients.
Acknowledgments
The help of Miss Ulla Ollesch and the staff of the neonatology department at the Vestische Kinderklinik for data entry is gratefully acknowledged. The expert linguistic advice of Heidi Walsh and Neil McIntosh is also very much appreciated. Special thanks to Charles E Metz who kindly provided the software for the ROC curve analyses.
Part of the data in this study are an integral part of Antje Westermann’s thesis.