Article Text

Download PDFPDF

Measuring neurodevelopmental outcome in neonatal trials: a continuing and increasing challenge
  1. Neil Marlow
  1. Correspondence to Professor Neil Marlow, UCL Institute for Women's Health, 74 Huntley Street, London WC1E 6AU, UK; n.marlow{at}


The need for outcome evaluations as part of clinical trials has never been greater. In this paper, issues around the design and data collection of such outcome evaluations are discussed in relation to how they may be best collected and the options available. There is a need for organisation of such evaluations and consistency of measures between trials to optimise efficiency.

  • Child Psychology
  • Data Collection
  • Neonatology
  • Neurodevelopment
  • Neurodisability

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

The mantra from a popular allegorical TV series declaims ‘I am not a number—I am a free man!’1 This neatly encapsulates the dilemma of clinical aftercare and research-based outcome evaluation following neonatal intensive care—for research and clinical audit, we need a number or categorisation that we can compare across time and between individuals and study groups; for clinical practice we need a careful evaluation of the strengths and weakness of a child's developmental profile in order to identify areas where support may be required. Research-based evaluations are often used for both, and clinical assessments are often used to describe research outcomes, although they are frequently not designed as such and lack the rigour of research-based measures.

Over the past decade research-based evaluations have become much more frequent as primary or coprimary outcomes, and our once high follow-up rates2 have become much less so, be it due to the impact of research governance constraints3 or due to unwillingness on the part of parents to complete the study. Getting complete follow-up is important as bias may accrue from only considering those children who are included in outcome reports—dropouts may have more problems than children reported4 ,5—and the use of imputation to estimate outcomes for the whole population is complex6–8 and may lead to reduced confidence in the findings. Furthermore, where in the past we were able to train and accredit assessors, the sheer quantum of babies requiring research-based evaluations has mushroomed as NHS-contracted time has shrunk, making the older strategies less effective; thus it is essential to re-evaluate what we are doing.

A significant portion of the difficulty is related to our success in obtaining research funding and better study design. For example, current UK trials are recruiting large numbers of children who need outcome evaluations—for example, progesterone to prevent preterm birth (OPPTIMUM): 1250 participants; benefits of oxygen saturation targeting (BOOST-II UK): 1000 participants; iodine supplementation study (I2S2): up to 1700 participants; hypotension in preterm infants (HIP): 800 participants; alongside population-based studies such as EPICure 2: 1000 participants, and late and moderate preterm birth (LAMBS): 2000 participants—and this is not an exhaustive list of studies recruiting more than 500 participants! Furthermore many studies enrol children who would not fall within the recommended populations for routine follow-up or have comparator groups that require developmental assessment; this has become even more important since the confidence in the major developmental measure has somewhat waned in its new edition (see below).9–12

Measuring outcomes for randomised trials or estimating impairment in the population are critical parts of current research strategies. Such evaluations may be part of efficacy studies—where the goal is to improve outcomes, such as in the recent magnesium13 or indomethacin trials14—or to establish that treatments are safe, such as in caffeine trials15 and several other current trials. There is a reasonable consensus internationally about the definition of neurodevelopmental impairment at 18–24 months, which usually includes a standardised developmental test score, the presence of cerebral palsy with motor impairment, severe visual impairment or profound/severe hearing loss.

This paper will discuss approaches to follow-up, what ages are appropriate and the extent to which these should be incorporated routinely into clinical practice.

Different approaches to follow-up in late infancy

For use as outcome measures for trials, assessments should have certain key characteristics. The outcomes defined should be an accurate reflection of a child's capabilities, relevant to the study, and question, and the measures used need to be reliable, repeatable and preferably predictive of later childhood, even adult, impairment. These requirements are a tall order as assessing neurodevelopmental outcome is not an exact science and observing a brief snapshot of an individual child's performance in a strange setting may not be a good reflection of their abilities. Conventionally child development is assessed at either 18–22 months of age corrected for preterm birth (North America) or at 24 months corrected age in Europe/Australasia. Developmental assessment at around 2 years can be lengthy and difficult to administer and there is much debate about the accuracy of predicting outcomes from such a young age.16 Developmental measures at 2 years may show much social bias, which again is challenging for studies where we are looking for often small and subtle differences. A consensus scheme for categorising outcomes in the UK has been published (table 1).17

Table 1

Summary of definitions for recommended outcome categories (British Association of Perinatal Medicine 2008)

Neurosensory outcomes

Most studies attempt to evaluate children face-to-face with a trained clinician. To carry out very accurate evaluations of hearing and vision is labour-intensive and most studies accept pragmatic definitions of functional neurosensory impairment. There is also little consensus over the assignment of a diagnosis of ‘cerebral palsy’, one reason that assessments in preterms are delayed to at least 2 years when children with transient dystonia are likely to improve. Most cerebral palsy registers delay registration until 4–5 years of age to ensure that the diagnosis is confirmed. Thus clear functional measures are much more effective means of classifying outcome at younger ages—using either the simple health status classification proposed in 199418 or, for motor outcomes, moving to the much better validated Gross Motor Function Classification System.19 If this approach is taken, then providing clinical assessments is made much easier and can perhaps be carried out by someone other than a doctor with adequate training and validation.

Developmental testing

The key issue is really how to evaluate the cognitive domain: what to use and who is to perform the assessments. The importance of this lies in the fact that in later life it is cognition, learning and behaviour that are the areas in which impairment is most prevalent. The gold standard has been the administration of a well-standardised developmental test, which covers the classical domains of motor (gross and fine), language and cognition. If a test is to be used to evaluate these domains, it needs to be psychometrically sound and predictive. For many years, the Griffiths Scales were used as well liked ‘developmental’ scales. These have been recently and belatedly restandardised, but have in the meantime, been overtaken by the Bayley family of tests as the measure of choice in Europe and across the world. The original test was restandardised in 1993 using the same format of mental and psychomotor subscales. In 2006, this was superseded by a third edition, which was a radical change. First, it was now split into two motors, two languages and cognitive scales, making comparison with the second edition difficult. Second, a different approach to norming the results was undertaken, with seeding of the reference population with 10% of children with developmental risk—prematurity, Down syndrome, etc. The resultant test gave normative scores that were on average 7 points higher than the second edition,12 ,20 in contrast to the lower scores anticipated from a simple restandardisation from the ‘Flynn effect’.12 In practice, the differences are greater than that when using perinatally recruited comparison groups,11 and more detailed assessment of the second and third editions in cognition and language revealed more generous scores in children who were performing at lower levels.9 These problems make a huge difference to the categorisation of outcomes for research purposes.10 Interpreting test scores now probably needs a local control population, which adds expense and methodological challenges to already expensive projects; otherwise a higher cut-off is necessary, such as using -1 standard deviation, which has uncertain predictive value, as opposed to -2 standard deviations of test scores as conventional. Furthermore, the third edition is not standardised in languages other than English, making extrapolation of scores derived following simple translation rather difficult. New full standardisations in Chinese and Dutch are under way and so are simple validations in many other languages.

Who is best placed to carry out these assessments? Where studies were modest in size, dedicated research staff recruited in a few centres provided a feasible option. However, and increasingly, studies cover a number of neonatal units and many are international, with centres recruiting relatively small numbers of participants, and this option becomes both unwieldy and unattractive to trainees who may be available for the posts.

Several attempts have been made to recruit locally-based assessors in each recruiting centre. This has advantages for the parents in that they are likely to be seen locally, and for assessors where travel is kept to a minimum. For some studies, this has been highly successful and excellent follow-up rates achieved. In contrast, for other more recent studies, there is often initial enthusiasm for training but no real commitment to validate performance on a test. Developmental testing must be done to a rigorous standard and variance in scoring from sloppy technique kept to a minimum. In clinical practice, a developmental test may provide a useful framework within which to explore clinically a child's strengths and weaknesses, but for research purposes, accuracy is critical. Successful testing also requires practice and familiarity as key attributes to ensure a tester carries out a reliable and accurate assessment. For busy clinicians whose time is restricted by clinical contracts this is increasingly difficult to achieve within the UK system, even though the National Institute for Health Research portfolio accrual system brings money back into their Trust to support their activity and sessional payments are usually available. Particularly when recruitment has been carried out through antenatal services, there may be little incentive to support these assessments, but it is often these studies in which the evaluation of safety is critical, particularly following the observation that antibiotics given to women in preterm labour may be associated with an increase in cerebral palsy and impairment in their children.21

Some studies have relied on non-medical assessors recruited for the study, often travelling widely over a geographical region. Developmental testing procedures are often better conducted by psychologists who understand the need for practiced standard techniques and accuracy of observations. However, other more medically focussed evaluations such as neurological testing to evaluate a child for cerebral palsy may be difficult, but this is facilitated by the use of functional assessments and can be easily completed by a nurse or psychologist. Use of contracted non-medical assessors for research studies can be highly effective and may prove to be the best way forward. Ideally these assessments should not duplicate assessments carried out locally and data do need to be shared with the local team if the assessments are nominally at the same age, but with the caveat that clinical tests may not have been conducted with the rigour and exacting standards required for research evaluations.

An alternative is to develop a network of assessors to work for the range of studies that are evolving. This would have advantages in terms of locality-based assessments and training/maintenance of skills, but does require central funding and standardised costing for trials; such a centre could easily run tracing and appointment setting for a range of studies and would make perinatal/neonatal trials much easier and more effectively organised than the current system of reinvention for each new trial. It would provide a ‘gold standard’ assessment to which could be added specific outcomes as directed by each specific trial.

Within the UK, neonatal services are organised into managed care networks and commissioned centrally. There is national guidance as to how services should be organised, termed the NHS Toolkit for High Quality Neonatal Services.22 This, and indeed most commissioning contracts, require that individual neonatal services should provide follow-up assessments to 2 years; in practice, this is patchy and often does not include a standardised developmental test. An alternative is to use parent report to categorise development and other outcomes. The current preterm AND after (PANDA) study23 is comparing parental data capture with that of routine systems; results are awaited but it does seem unlikely that routine data will be useable for research purposes without some quality assurance and improved coverage.

The Gross Motor Function Classification System by parent report is well validated19 and a range of other developmental tests have recently been reviewed.24 Of the parent report measures, most are screening tests that will tend to over-report children with borderline abilities, which is not a problem in comparative randomised trials as the over-identification generally will not be biased and will be equally and randomly distributed between the two arms of the cohort, provided the test has reasonable psychometric properties. For clinical purposes, over-referrals on parent reports have been shown to be an at-risk group for whom further developmental assessment may be beneficial.(25) The International Neonatal Immunotherapy Study (INIS) trial25 has used a derivation of the Parent Report of Children's Abilities–Revised (PARCA-r)26 ,27 to evaluate 2-year outcomes successfully. The weakness of using parent report is the tendency to have a high proportion of dropouts as in all questionnaire-based studies, avoided nicely in INIS. Simple questionnaires may be less predictive of later impairment but may be completed by telephone for parents who have reading difficulties, at a home visit or online by those who do not respond to mail. Epidemiological screening by electronic means is not readily available for developmental assessments at young ages but is an important alternative for even complex instruments, such as the Development and Wellbeing Assessment for evaluating behavioural problems in older children, and perhaps need to be considered. Using solely a parent report questionnaire will not allow evaluation of other clinical items—such as growth, with any reliability.

The balance between using face-to-face evaluation and parent report as outcome evaluation methodologies is an important decision in initial trial design, and the choice dependent on the nature and primacy of the outcomes in question. Very successful studies may be achieved using either route, but the balance between costs, completeness of data collection and accuracy of assessment must be consistently revaluated.

Other critical aspects of outcome evaluation

Maintaining a study cohort for several years is hard work. It is key that there are systems for emphasising the importance of the delayed assessment to participants and parents. Many parents forget that this is part of the study, and if their child is doing well, think it unnecessary; conversely, children with serious impairments may have had a lot of appointments and assessments and it is difficult for them to put their child through further ones. Response rates relate to social advantage, as clearly seen in EPICure 2.8 When attempting to improve responses, generally the last few will have a higher proportion of problems compared with those that are easier to recruit.4 ,5

Key aspects of maintaining contact are shown in the box 1. Many aspects of contacting participants have been trialled themselves28 and those conducting the studies should evaluate their processes to ensure they maximise response rates. Furthermore, it is helpful if permissions are obtained early on to contact health professionals for information to at least ensure that some details are captured on all children and avoid the difficulty of later contacts.

Box 1 Key strategies to maximise follow up rates:
  • Before discharge home:

    • Obtain contact details (landline, mobile, email, Facebook name, etc)

    • Obtain details of relative who may be contacted if contact lost

    • Leaflet explaining importance of the follow examination

    • Study gift (small toy, t-shirt with study logo)

  • After discharge before assessment due:

    • Dedicated follow-up coordinator

    • Use national tracing strategies where possible

    • Maximise mailing appearance, appropriate reading age, etc28

    • Prospective contacts to ascertain health status

      • Telephone (use mobile with unblocked number)

      • Interim letters with change of address cards

    • Newsletters, Facebook page, website

    • Birthday and Christmas, New Year holiday cards

    • Short interim questionnaires

      • To minimise recall bias for health contacts

      • Focused on relevant issues for the child and family

  • Main outcome assessment

    • Arrange well before time

    • Ring to confirm attendance

    • Pay travel expenses

    • Flexibility over time and site of assessment

  • Following assessment

    • Always write with thanks

    • Feedback results of assessments

    • Offer research summary at end of study

Have we got outcome evaluation right?

To this point, the focus has been on evaluation at the end of the second year. This approach is considered a compromise mainly because of the timing of neurological assessment in ex-preterm infants and the relatively poor prediction in terms of earlier developmental scores. For those studies where follow-up concentrates on infants with severe impairments (eg, following intrapartum hypoxia), evaluation at 12–15 months may be sufficient, followed by school age assessment if evaluation of more subtle psychological or behavioural outcomes is required. Indeed there may be advantage in delaying formal assessment for trials to 3–4 years of age as the available instruments are better and neurological assessment more likely to be accurate.

One relatively untapped source of independently collected population data in the UK is the education system's National Attainment Tests.29 Their relative value is somewhat increased by the ability to review single assessment test results in light of the child's classroom performance at Key Stage 1, but children attending many private schools or special schools may not take the tests. Key Stage 2 tests are consistent across the country and independently centrally marked, making them potentially more granular and discriminative.

One area that would also make a huge difference is the development of accurate bridging biomarkers, which would have advantages in reducing the need for long-term follow-up, reducing trial size and thereby speeding up the process of developing new interventional strategies. At present the best biomarkers relate to MRI, using advanced computational techniques such as tract-based spatial statistics,30 or proton spectroscopy,31 but are currently restricted to outcomes in term infants following intrapartum hypoxia in term-born infants. In preterm children the predictive value of such techniques appears to be less and the increasing understanding of the important processes that underlie deficits in ex-preterm children means that we are as yet some way off being able to dispense with long-term outcomes, with all their inherent outcomes.


For the foreseeable future, outcome evaluations will remain important as outcomes for randomised trials. Trials are increasing in number and size. Outcomes may be collected in a multitude of ways and should be designed for each trial in turn. However, I believe there is sufficient commonality in outcome definition to warrant our developing a national strategy here in the UK, for example, to ensure that we remain at the forefront of neonatal research and develop effective interventions to reduce impairment and disability in our fragile patients.


The author receives part funding from the Department of Health's National Institute for Health Research Biomedical Research Centre's funding scheme at University College London Hospital/University College London. The author is grateful to Dr Samantha Johnson for her valuable comments on the manuscript.



  • Competing interests None.

  • Provenance and peer review Commissioned; externally peer reviewed.

Linked Articles

  • Fantoms
    Ben Stenson