Article Text

Download PDFPDF

Value and validity of neonatal disease severity scoring systems
  1. Jon S Dorling1,
  2. David J Field2
  1. 1
    Nottingham University Hospitals Trust, Neonatal Unit, Nottingham City Hospital, Nottingham, UK
  2. 2
    Department of Health Sciences, University of Leicester, Leicester, UK
  1. Professor D Field, Department of Health Sciences, University of Leicester, 22–28 Princess Road West, Leicester LE1 6TP, UK; david.field{at}

Statistics from

For those involved in neonatal care the concept of risk adjustment, in the informal sense, is part of everyday life. We regularly talk to parents about the risk of death in their baby if he or she is born at a particular gestation. Similarly we are aware that the risk of death as we perceive it can be weighted by other events such as being born with particularly low Apgar scores. The disease severity scoring systems that exist in neonatal care have developed through a process that formalises the assessment of the risks attached to a particular baby. Archives of Disease in Childhood has published previously a review of how such scores are derived with a commentary on some of the most widely used systems.1

The use of disease severity scores arose first in other specialties primarily as a means of allowing comparison between heterogeneous groups of patients. For example how can you compare the efficiency of two adult orthopaedic units if the length of stay in hospital A is significantly longer than hospital B but the average age of the patients is significantly greater in hospital A? The development of a disease severity score would allow such variation in patient mix to be taken into account and the two units compared fairly with variation in their mix of patients, at baseline, removed. In neonatal care, survival rate was chosen as the most important outcome for comparison and hence most scores were designed to adjust for risk of death particularly in preterm babies.

Those who have developed the scores have made different decisions about the importance of accuracy of prediction versus complexity of the score. For example is it better to have a score based on just five factors which can account for 90% of the patient variation or to use a score based on 15 factors, and hence much more time consuming in terms of collecting the relevant data items, which accounts for 94% of the variation? Nonetheless they are all designed to permit “fair” comparison after accounting for individual patient differences. The systems in current use all provide a “fair” means of adjustment within the parameters for which they were designed. Each of the scores is limited in terms of the group to which they can be applied, which largely reflect the derivation cohort. For example, it is possible to adjust a neonatal population <33 weeks’ gestation for risk of death using the Clinical Risk Index for Babies (CRIB) score but the same score cannot be used on more mature babies. The Score for Neonatal Acute Physiology (SNAP),2 which can be used for all admissions of any gestation, has been criticised for being derived from a cohort containing relatively few infants less than 1.5 kg. The group of babies at or below 25 weeks’ gestation highlights another problem with the indiscriminate use of disease severity scores. The attitude to these babies in terms of providing full intensive care varies between units and even between clinicians in the same unit. In general disease severity scores cannot adjust for such factors.

Although a detailed description of model development is beyond the scope of this review, the reliability and validity of scoring systems warrant discussion. Risk correction methods must do what they are set up to do. This is assessed by tests of calibration and discrimination. Calibration describes the accuracy of the test and is ideally assessed by comparing the observed outcome of each decile of predicted risk. In other words the outcome of those with a predicted risk of death of 0–10%, 10–20%, 20–30% and so on is compared with the actual outcome for those groups. Discrimination describes the ability of the score to separate infants into the correct group. This can be assessed using a classification table providing sensitivity, specificity and predictive values; however this requires a cut-off risk to be specified. An example of a cut-off would be scoring 6 points or more on the Nursery Neurobiologic Risk Score, giving a 100% specificity and 100% positive predictive value for an abnormal outcome at 24 months of age.3 Whilst this is useful for those infants with such a score it tells us little about infants with lower scores. Indeed there is a range of scores that individual infants can be given: receiver operating characteristic curve (ROC) analysis allows an assessment of the risk score across the entire range.4 The software calculates the area under the ROC curve by varying the cut-off value across the possible values and then plotting the sensitivity (true positive rate) against the 1–specificity (false positive rate). A good model produces a high true positive rate with few false positives and therefore produces a value close to 1 where the entire graph is filled. Values above 0.8 are usually accepted as being adequate. Figure 1 shows an example of an ROC with a value of 0.82.

Figure 1 A receiver operating characteristic curve (Az = 0.82, 95% CI 0.78 to 0.85).

Scores are either derived from a large population or from expert opinion. They are then validated in a suitable cohort of infants. Successful validation is a vital requirement for a model to be accepted for use in other cohorts. If this process is also repeated in a very different cohort such as another country this clearly enhances the model’s validity. Other factors that enhance model validity are the factors that are used: are these biologically related to the outcome, are they easily available, accurately measurable, frequently seen and reliably recorded?5 There is a paucity of evidence regarding the reliability of observations for the commonly used risk scores (ie, that a second observer will obtain the same results). An important topical example is the use of the admission temperature for the CRIB II score.6 This could be measured differently in different units—for example some units use rectal measurement, others the axilla. As rectal temperature is usually higher this could lead to systematic bias when comparing these units as the infants with axillary measurements might be predicted to do worse than if they had had rectal temperature measurement. The use of the admission temperature also highlights the potential for treatment received before admission to affect the score; those units that are good at maintaining temperature during initial resuscitation will have a lower risk score given to their babies by CRIB II. They may therefore seem to perform worse than they actually do.

Can scores for predicting mortality such as CRIB be used generically as markers of disease severity for other bad outcomes such as chronic lung disease and poor developmental outcome? Certainly attempts have been made to use such scores more widely but it is not logical as they were clearly derived for a different endpoint. Where they have been used in this way they generally have been found to have some predictive value but this is not surprising given the strong influence of birthweight and gestation as components of these scores.711 Specific scores designed for other outcomes, principally morbidity, such as long-term neurodevelopmental outcome have been derived but generally they have performed less well than those scores designed to “predict” death. In particular it appears that data from later in the admission are needed to predict neurological outcome accurately.3 This too is not surprising given the huge range of factors than come into play during a long neonatal stay and result in—for example, chronic lung disease or adverse neurodevelopmental outcome. It seems likely that any score used to adjust for risk of such morbidities will need to be based on a large number of factors if it is to perform significantly better than chance. Clearly this would not be an adjustment of baseline characteristics. There is also the issue that for some of these outcomes there is trade-off with mortality in that very aggressive treatment may result in survival with severe morbidity in one unit whereas the same situation in another unit would be considered sufficient for withdrawal of intensive care.

Given the above comments it is perhaps reasonable to ask, do disease severity scores tell us anything we want to know? We suggest the answer is yes on two counts. First, although the approach of disease severity adjustment is not easily applicable to individual families/babies by—for example, entering their data into a formula (the confidence intervals would be huge)—the methodology does provide the best tools for advising families regarding the chances of survival in their baby.

The survival grids, first published in 1999, are widely used in the UK for discussions with families in which premature delivery is considered likely.12 They are derived using three of the factors most widely recognised as having the greatest influence on survival: birthweight, gestation and gender. However, they provide for most families a clear picture of how these three variables influence the baby’s chances of survival.

Second, they do allow the monitoring of performance of units or even whole networks and thereby provide a crude means of ensuring equity in terms of quality of care. In this role disease severity scores are being used as they were originally conceived—that is, ensuring that quality of care, at least in terms of survival, is comparable between very different settings by adjusting for case mix. Table 1 shows anonymised data from the 2005 Neonatal Survey which provides this sort of comparison annually for neonatal units in the East Midlands and Yorkshire. By comparing actual mortality with predicted mortality derived after weighting the data from each baby in each unit using the CRIB II score it is possible to see if the mortality ratio (actual mortality: predicted mortality) lies outside the 95% confidence intervals (ie, is significantly better or worse than predicted). All the units shown are being compared in this way after adjusting for their case mix and it can be seen that despite very different mortality rates all are performing within the expected range. It is worth noting that the low activity in some units means that using aggregated data over three years is a more appropriate size of population for this type of comparison since 95% confidence intervals are narrowed making significant variation easier to identify.

Table 1 An example of mortality rates for babies <33 weeks’ gestation in 18 hospitals. Expected mortality is based on CRIB II estimates

This type of approach has also been useful in understanding the influence of other factors such as transfer. At one time postnatal transfer was considered hazardous because of an apparently higher mortality compared with babies of the same gestation whose whole course had been in one unit. However, crude comparisons of mortality such as this were, of course, entirely misleading since the groups of babies selected for postnatal transfer were identified, in general, because of the severe problems they had. By applying disease severity scoring it was possible to identify that this group of babies had a higher predicted mortality compared with the whole, unselected population of the same gestation that remained within a neonatal unit, some of whom would have high disease severity and some of whom would not.

Finally, it is worth considering who wants these data? Certainly enthusiastic individual clinical teams are keen to ensure that their own performance is of an adequate standard. However, increasingly it is those groups with responsibility for supervision and oversight that want a means of providing “quality assurance”. Such data are already available in relation to cardiac surgery for the whole of the UK, and can be provided surgeon by surgeon.13 For the East Midlands and Yorkshire unblinded survival data, corrected for disease severity, are already being published annually for each unit based on babies <33 weeks’ gestation.14 At present the National Neonatal Audit Project does not include data on mortality, however, once a national mechanism exists in the UK for transfer of the relevant individual patient details producing unit by unit survival figures, certainly for babies <33 weeks’ gestation, would be a natural next step.


View Abstract


  • Competing interests: None.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.