Article Text

Download PDFPDF

Validity of neurodevelopmental outcomes of children born very preterm assessed during routine clinical follow-up in England
  1. Hilary S Wong1,2,
  2. Frances M Cowan2,
  3. Neena Modi2
  4. On behalf of the Medicines for Neonates Investigator Group
    1. 1 Department of Paediatrics, University of Cambridge, Cambridge, UK
    2. 2 Section of Neonatal Medicine, Department of Medicine, Chelsea and Westminster Hospital Campus, Imperial College London, London, UK
    1. Correspondence to Dr Hilary S Wong, Department of Paediatrics, University of Cambridge, Box 116, Cambridge Biomedical Campus, Cambridge CB2 0QQ, UK; syw27{at}


    Objective To determine the validity of assessing and recording the neurodevelopmental outcome of very preterm infants during routine clinical follow-up in England.

    Design Children born <30 weeks gestation, attending routine clinical follow-up at post-term ages 20–28 months, were recruited. Data on neurodevelopmental outcomes were recorded by the reviewing clinician in a standardised format in the child’s electronic patient record, based on a set of key questions designed to be used without formal training or developmental testing. Using a predefined algorithm, each participant was classified as having ‘no’, ‘mild/moderate’ or ‘severe’ impairment in cognitive, communication and motor domains. All participants also received a research assessment by a single assessor using the Bayley Scales of Infant Development, third edition (Bayley-III). The sensitivity and specificity of routine data in capturing impairment (any Bayley-III score <85) or severe impairment (any Bayley-III score <70) was calculated.

    Results 190 children participated. The validity of routine assessments in identifying children with no impairment and no severe impairment was high across all domains (specificities 83.9%–100.0% and 96.6%–100.0%, respectively). However, identification of impairments, particularly in the cognitive (sensitivity 69.7% (55.1%–84.3%)) and communication (sensitivity (53.2% (42.0%–64.5%)) domains, was poor.

    Conclusions Neurodevelopmental status determined during routine clinical assessment lacks adequate sensitivity in cognitive and communication domains. It is uncertain whether this reflects the assessment or/and the recording of findings. As early intervention may improve education and social outcomes, this is an important area for healthcare quality improvement research.

    • preterm birth
    • neurodevelopment
    • electronic health records
    View Full Text

    Statistics from

    What is already known on this topic?

    • Postneonatal intensive care outcome monitoring is necessary for patient care and to assess the long-term sequelae of preterm birth, but population outcome data are limited.

    • Routine neonatal electronic health records may be a potential source of population neurodevelopmental outcome data.

    What this study adds?

    • The utility of neurodevelopmental data, routinely recorded in electronic patient records in England, is limited by poor sensitivity in identifying children with impairment.

    • Routine assessments have inadequate sensitivity in identifying impairments in the cognitive and communication domains, areas where early intervention may improve educational and social outcomes.

    • Standardised assessments and validated documentation tools minimise inter-rater variability and judgement bias during neurodevelopmental evaluation and is an important area for healthcare quality improvement research.


    Postneonatal intensive care follow-up is necessary to monitor the long-term sequelae of preterm birth. The British Association of Perinatal Medicine (BAPM) recommends follow-up assessments up to 2 years post-term age for children born under 32 weeks gestation or with a birth weight below 1500 g.1 The UK Audit Commission proposed that all neonatal units collect data in a nationally agreed format.2 Yet, on a country-wide basis, neonatal outcome data remain limited.3

    In 1990s, several studies reported that the routine information systems in place at that time were inadequate for neonatal outcome monitoring.4–6 Since then, all UK neonatal units have employed electronic patient record (EPR) systems, with the BadgerNet platform7 being the most widely used. This electronic system includes a ‘2-year dataset’ developed by the Thames Regional Outcomes Group with standardised definitions that allow the classification of preterm children into categories of neurodevelopmental outcome (normal, mild/moderate or severe impairment), based on a set of key questions designed to be used without the need for specific training of the assessor or formal neurodevelopmental testing of the child.1 In 2010, the National Neonatal Research Database (NNRD) containing variables extracted from the EPR systems was created. The NNRD has complete population coverage of all neonatal unit admissions, is used by the Royal College of Paediatrics and Child Health National Neonatal Audit Programme (NNAP)8 and is a potentially valuable outcome data source. However, the approach to neurodevelopmental assessment in clinical practice varies widely. Children may be assessed by health professionals from different backgrounds and levels of experience using varying assessment methods. Often, assessments are based on clinical judgement without standardised methodology. Therefore, the quality and utility of these routinely recorded outcome data are uncertain.

    We aimed to determine the validity of neurodevelopmental outcomes based on data recorded during routine follow-up assessments through comparison with outcomes determined by assessments conducted using the Bayley Scales of Infant and Toddler Development, third edition (Bayley-III).9


    Study design and recruitment

    This was a cross-sectional study conducted in 13 hospitals. Eligible participants were children born before 30 weeks gestation attending routine clinical follow-up between the post-term ages of 20 and 28 months. Exclusion criteria were children who had received Bayley-III assessments in the past 6 months, to prevent ‘practice effect’ bias, and children from non-English-speaking families. Between June 2010 and July 2012, the parents of eligible participants were invited to join the study by post before their clinical appointment. Parents were also approached by the research team and recruited during their appointment. Informed written consent was obtained.

    Study procedures

    Research assessment

    All participants were assessed using the Bayley-III scales by a single accredited assessor (HSW), who was blinded to the child’s clinical details. Validation checks of the assessor’s reliability and consistency through independent scoring of three assessments by expert assessors confirmed high interobserver agreement (items in agreement: 97.2%–98.6%). We aimed for the research assessment to be conducted within 1 month of the participant’s clinical assessment. The Bayley-III yields norm-referenced scaled scores (mean 10, SD 3) for five subtests (Cognitive, Receptive/Expressive Communication, Fine/Gross Motor) and composite scores (mean 100, SD 15) for three scales (Cognitive, Communication and Motor).

    Clinical assessment

    Clinicians, blinded to the results of the Bayley-III assessment, undertook the participants’ clinical assessment. Information on each child’s development was recorded as responses to 16 dichotomous yes/no questions (see online supplementary table 1) on the EPR. Entry of this information is a requirement for participation in the NNAP.8

    Supplementary file 1

    Data extraction

    EPR data on all neonatal admissions in England were included in the  NNRD with permission from the Caldicott Guardians of each NHS Trust, the UK National Research Ethics Service (10/80803/151) and the Health Research Authority Confidentiality Advisory Group (8-05(f)/0210). With parental consent, we extracted neonatal and 2-year assessment information on each participant from the  NNRD. For the purpose of assessing selection bias, anonymous demographic and clinical data were also extracted for all infants born between 1 January 2008 and 31 December 2010, at gestational ages below 30 weeks, and discharged from the participating study sites (the ‘baseline population’).

    Impairment classification

    Based on their EPR data, participants were assigned a neurodevelopmental outcome category of ‘no’, ‘mild/moderate’ or ‘severe’ impairment for each domain according to a classification algorithm developed for the NNAP in accordance with definitions established by a BAPM/NNAP working group (BAPM/NNAP criteria; online supplementary figure 1).1 There is no specific cognitive domain in the EPR. However, as the questions about ‘developmental level’ were designed with the intention of capturing the child’s cognitive function (author correspondence with the Thames Regional Perinatal Group), the EPR data in this category were used to assign the participants’ cognitive outcome and use for comparison with the Bayley-III cognitive scores from the research assessment. The BAPM/NNAP criteria originally specified ranges of standardised developmental scores or quotients for the classification of cognitive impairment. However, as the EPR data were intended to be collected without requiring formal neurodevelopmental assessment, the key questions in the development (cognitive) domain were modified so that impairment severity was based on ‘how many months behind’ each child was and not developmental scores. An overall level of impairment was defined based on the worst outcome from all domains.

    Participants were also classified into three categories according to their Bayley-III scores from the research assessment: ‘higher than −1 SD’ (no impairment), ‘−1 to −2 SD’ (mild/moderate impairment) and ‘lower than −2 SD’ (severe impairment) from the normative mean. The worst category of outcome assigned through scaled and composite scores was used, and the overall outcome of each participant was based on the worst category of impairment from all domains.

    Since the information captured through the questions on the EPR and the Bayley-III differ, we aimed to assess the internal validity of comparing routine and research assessments where different outcome classification methods were employed. To do this, we compared the concordance of outcome classification by Bayley-III scores and BAPM/NNAP criteria, using only observations made at the research assessment. For the motor and communication domains, the BAPM/NNAP criteria outcome measures were either quantifiable (eg, the number of meaningful words or signs) or clear cut (eg, ability to sit or walk) and could be judged objectively during the research assessment. The cognitive domain, based on perceived extent of developmental delay, was more subjective. Although a ‘developmental age equivalent’ score can be derived from the Bayley-III raw score, this is a ‘typicality’ score that indicates the average age at which a given raw score is typical and does not specify an individual child’s functional level. Hence, we felt that an equivalent variable cannot be objectively derived for a sufficiently robust comparison.

    Statistical analysis

    The classification of outcomes from the clinical assessment was cross-tabulated against the research assessment. Taking the research data as the ‘gold standard’, the sensitivities and specificities of the clinical data in identifying children with any impairment and severe impairment were calculated. Robust standard errors were used to calculate the 95% CIs to account for potential clustering by study sites. Analyses were repeated on all singleton births and one randomly selected child from each multiple birth set, to examine the effect of correlated outcomes within multiple birth sets. Stratified analysis was performed to examine the association between demographic factors, neonatal factors and follow-up methods and the validity of the clinical data.

    The concordance between the two classification methods, judged by the assignment of impairment through the research assessment, was measured with Cohen’s kappa statistic (κ).

    A precision analysis for the estimated sensitivity of the clinical assessment in identifying children with Bayley-III scores below than −2 SD from the normative mean was used to calculate the target sample size. We intended to recruit a stratified sample with higher proportions of children born at lower gestations and therefore at higher risk for impairment, to improve the precision of the study while maintaining a practical sample size. Assuming that 25% of children born at or before 25 weeks (higher risk), 15% of children born at 26–27 weeks (medium risk) and 2% of children born at 28–29 weeks gestation (lower risk) achieve Bayley-III scores lower than −2 SD and the sensitivity is the same for all risk groups, we aimed to recruit 500 children (200 each from the higher-risk and medium-risk groups and 100 from the lower-risk group), to achieve a precision of CI half-width ±10% for an estimated sensitivity of 80%.

    Differences between categorical variables were analysed using Pearson’s χ2 test; continuous variables were compared using the Student’s t-test or Mann-Whitney U test. All statistical analyses were performed using Stata statistical package V.11.0.


    Study population

    Two hundred and eight children were recruited to this study. One child with ataxic cerebral palsy could not be fully assessed, and three children did not cooperate with the assessment. Three children did not attend their clinical follow-up visit, and there were missing electronic data in 24 cases. One hundred and ninety children on whom both research and clinical data were available in at least one outcome domain, including the four children with incomplete assessment, form the study cohort (figure 1). We concluded the study with a sample size smaller than target as the interim analysis established that the sensitivity estimate was lower than predicted and the desired precision requires a sample of at least 680 children, which could not be achieved without significantly prolonging the study. The characteristics of the study population are summarised in table 1.

    Figure 1

    Flow chart of children through research and routine assessments to form the study population.

    Table 1

    Characteristics of study population versus non-participants born <30 weeks gestation in 2008–2010 and discharged from the study sites

    Validity of clinical data

    The mean (SD) post-term age at clinical assessment was 24.4 (2.3) months. The research assessment took place at a median (IQR) interval of 8 (0–27) days after the children received their clinical assessments. Outcome data were entered into the EPR by clinical consultants for 111 children (36 neonatologists, 42 general paediatricians, 33 community paediatricians, together comprising 58.4%), junior doctors in 73 (38.4%) and administrative staff in 6 (3.2%) cases. Only 67 (35.3%) children were assessed using standardised or screening assessment tools (Griffiths Mental Development Scales (n=44), Schedule of Growing Scales (n=19), Alberta Infant Motor Scale (n=4)) during their routine review.

    The estimated sensitivities and specificities of the clinical data in each developmental domain are presented in table 2. This is based on the classification of impairment from the Bayley-III scores and includes data from all children as sensitivity analyses revealed that potential correlated outcomes from siblings did not affect the result. The estimated specificities were high across all domains. However, the validity in identifying and categorising children with impairments was variable. The sensitivities for gross motor impairment were high, particularly when the impairment was severe. In the cognitive domain, the sensitivity for the identification of any impairment was 69.7% (95% CI 55.1% to 84.3%) but dropped to only 28.6% (5.0% to 52.2%) for severe impairment. Agreement between clinical and research information was worst in the communication domain; the sensitivity for any receptive communication impairment was only 23.1% (6.7% to 39.5%). Overall, more children were classified as having an impairment through the research compared with the ‘routinely acquired’ clinical data (figure 2). These findings were consistent when the comparison was repeated with the research outcomes classified using the BAPM/NNAP criteria (see online supplementary table 2). The results of the stratified analyses are presented as a series of clustered bar chart online (see online supplementary figures 2–10). We observed possible increased sensitivity in identifying cognitive impairment if a standardised test was used during clinical assessment (see online supplementary figure 8), although this was not statistically significant (CI of sensitivity estimates overlap). There was no clear effect on the validity of the clinical data of gestational age at birth, sex, supplemental oxygen requirement at 36 weeks postmenstrual age, maternal Index of Multiple Deprivation, exposure to the English language, age at assessment, seniority of clinical assessor and time interval between clinical and research appointments.

    Figure 2

    Classification of the severity of neurodevelopmental outcome by clinical and research assessments.

    Table 2

    Comparison of the categorisation of impairment by clinical assessment (British Association of Perinatal Medicine/National Neonatal Audit Programme criteria) and research assessment (Bayley-III scores) and the sensitivities and specificities of clinical data in identifying children with any impairment and severe impairment

    Concordance between the classification criteria

    The comparison of impairment classification by Bayley-III scores and the BAPM/NNAP criteria, based only on findings at the research assessment, is shown in table 3. There was moderate agreement in the communication domain (κ 0.59 (0.49–0.68)) and substantial agreement in the motor domain (κ 0.76 (0.58–0.87)).

    Table 3

    Concordance in the classification of impairment by Bayley-III scores and the BAPM/NNAP criteria


    We found that information recorded at a routine clinical visit in England is of insufficient accuracy for identifying very preterm children with neurodevelopmental impairment at 2 years of age when compared with a research-standard assessment. However, agreement between routine clinical and research categorisations was strong in the absence of impairment (high specificity). We estimate that, using routine data, approximately 30% of children with at least mild cognitive impairment and nearly 50% with at least mild communication impairment would be classified as having no impairment. Therefore, substantial numbers of children with clinically relevant impairment who would benefit from early interventions are likely to go unrecognised. Further, our findings call into question population estimates of impairment prevalence that are based on current processes.

    The strengths of our study include multicentre involvement allowing different postdischarge follow-up practices and children from a variety of backgrounds to be included. A single assessor conducting the research assessment eliminates inter-rater variation. The failure to achieve the target sample size is a limitation resulting in a wider margin of error for the sensitivity estimates. The study population consisted of proportionally more white children with less neonatal morbidity and/or living in less deprived areas than the baseline population. This selection bias could have been introduced by attrition of children from routine NHS follow-up, non-random recruitment and the exclusion of non-English-speaking children; thus, the study population may be at lower risk for impairment than the target population.

    The disagreement between clinical and research categorisations could partly be due to the methodological differences between the Bayley-III scores and the BAPM/NNAP criteria. The BAPM/NNAP criteria do not make allowance for participants younger or older than 24 months post-term age, and in our study, only 38 (20%) children attended their routine appointment exactly at age 24 months. The EPR questions were based on criteria intended to identify children with major functional loss who would experience severe lifelong disabilities.10 Therefore, some of these questions may not have sufficient sensitivity to identify children with milder difficulties. Additionally, on the EPR, the use of the term ‘development’ to refer to the cognitive domain is misleading. It is possible that clinicians might assign a development level influenced by their observations of motor and language skills, leading to misclassification. However, we found good concordance between the two classification methods in the communication and motor domains, implying that the structural and content differences could not fully explain the low sensitivities observed.

    Only 35% of children were assessed using a standardised assessment tool during their routine visit. In a study comparing the diagnosis of cognitive impairment at age 5 years made using an intelligence test versus judgements by paediatricians, agreement was only fair (κ 0.39).11 Even if standardised assessments are used, the agreement between different tools in classifying impairment is unknown, very few comparative studies having been undertaken. Chaudhary et al reported that at 22 months, children obtained 5 points higher on the Bayley Scales of Infant Development, second edition Mental Developmental Index than the Griffiths Scales developmental quotient.12

    The use of a standardised developmental assessment is often considered too time consuming and expensive to adopt into routine practice. We also found that outcome assessment by senior assessors did not enhance impairment detection. Therefore, other cost-effective follow-up care models with higher precision in impairment recognition are required. Standardised parent-completed questionnaires have been developed as an inexpensive alternative. In particular, the Parent Report of Children’s Abilities13 14 has been shown to have strong psychometric properties when validated against the Bayley Scales14 15 for use with very preterm children. However, the response rates achieved through parent-completed questionnaires had been low, such as the 51.8% obtained by Field et al.16 With the increasing recognition of allied healthcare professionals providing care in enhanced roles, it may be possible to develop specialist roles, for example, for psychologists, to undertake reliable standardised neonatal outcome assessments and also to extend the roles of health visitors to capture developmental data of children born preterm during developmental screening in the universal child health surveillance programme,17 although the feasibility and acceptability of this practice has not been examined.

    Neonatal neurodevelopmental outcome information is recorded for multiple purposes. For the individual, clinical follow-up focuses on identifying children who are experiencing difficulties and may require intervention. The routine use of an expensive, time-consuming standardised assessment tool for developmental screening might be unjustified. In contrast, population-level data need to be valid and reliable to facilitate analyses of trends, distributions of impairment prevalence and impairment determinants and to assist in health provision planning. Here, a standardised assessment tool that provides a numerical ‘score’ that would allow comparison between populations and over time would be ideal. The establishment of a coordinated neonatal follow-up programme at a regional or national level, in which children are assessed at uniform ages with a common set of tools, would be in theory an ideal means to obtain high-quality outcome data. However, even with such a robust process, the Swiss national follow-up programme was still limited by an attrition rate of 19% at 2 years.18 In England and Wales, electronic 2-year outcome data were only available for 60% of eligible infants born before 30 weeks gestation in 2015.19

    Assessments at 2 years of age underestimate the likelihood of school-age cognitive difficulties, even when performed by accredited assessors using standardised tools.20 Structured assessment for very preterm children and other vulnerable groups, at ages beyond 2 years, by appropriately trained personnel, requires urgent consideration as a standard of care.

    In conclusion, currently available clinical information obtained at routine follow-up is of inadequate precision for both patient care and population reporting. Our study illustrates the importance of careful evaluation of efficacy and effectiveness prior large-scale implementation of any approach and emphasises the urgent need for the development of reliable processes for population-based neurodevelopmental assessments of children born preterm.


    The authors thank the staff from the participating hospitals (Chelsea & Westminster, Ealing, Hillingdon, Homerton, Newham, North Middlesex, Northwick Park, Queen’s (Romford), Rosie (Cambridge), Royal London, St Thomas’, West Middlesex and Whipps Cross) for assistance with recruitment and in capturing electronic data, the Neonatal Data Analysis Unit, Imperial College London team (Eugene Statnikov, Daniel Gray (data analysts), Shalini Santhakumaran (statistician) and Richard Colquhoun (manager)) for data management and administrative support.


    View Abstract


    • Contributors HSW, FMC and NM contributed to the conception and design of the study and analysis and interpretation of data. HSW drafted the manuscript; FMC and NM reviewed it critically for important intellectual content. All authors read and approved the final manuscript.

    • Funding This paper presents independent research funded by the National Institute for Health Research (NIHR) under its Programme Grants for Applied Research Programme (Grant Reference Number RP-PG-0707-10010). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.

    • Competing interests None declared.

    • Ethics approval Royal Free Hospital Research Ethics Committee (REC 10/H0720/35).

    • Provenance and peer review Not commissioned; externally peer reviewed.

    • Data sharing statement All available data can be obtained by contacting the corresponding author.

    • Collaborators Medicines for Neonates Investigators are Deborah Ashby (Imperial College London), Peter Brocklehurst (University of Birmingham), Kate Costeloe (Queen Mary University of London), Elizabeth Draper (University of Leicester), Jacquie Kemp (London), Azeem Majeed (Imperial College London), Neena Modi (Imperial College London), Stavros Petrou (University of Warwick), Alys Young (University of Manchester), Jane Abbott and Zoe Chivers (Bliss, London).

    • Presented at This study was presented at the British Association of Perinatal Medicine section of the Royal College of Paediatrics and Child Health 2015 Annual Conference in Birmingham, UK.

    Request Permissions

    If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.