Article Text

This article has a correction. Please see:

Download PDFPDF

Diagnostic tests for bacterial infection from birth to 90 days—a systematic review
  1. P W Fowliea,
  2. B Schmidtb
  1. aDepartment of Child Health, University of Dundee, Scotland, bDepartment of Paediatrics, McMaster University, Hamilton, Ontario, Canada
  1. Dr P W Fowlie, Department of Child Health, Ninewells Hospital, Dundee DD1 9SY.


AIM To determine the clinical value of common diagnostic tests for bacterial infection in early life.

METHODS A Medline search (1966–95) was undertaken to identify studies that reported the assessment of a diagnostic “test,” predicting the presence or absence of bacterial infection in infants up to 90 days of age. The quality of each selected study was assessed using defined criteria. Data were extracted twice to minimise errors.

RESULTS Six hundred and seventy articles were identified. Two independent investigators agreed that 194 studies met the inclusion criteria (κ = 0.85), 52 of which met primary quality criteria; 23 studies reported data on (a) haematological indices, (b) C reactive protein evaluation, and (c) surface swab assessment. For haematological indices, the likelihood ratios for individual tests ranged from 20.4 (95% confidence interval 7.3 to 56.8) for a white cell count < 7000/mm3 to 0.12 (0.04 to 0.37) for an immature:total (I:T) white cell ratio < 0.2. For C reactive protein evaluation, the likelihood ratios ranged from 12.56 (0.79 to 199.10) for a value of > 6 mg/l to 0.22 (0.08 to 0.65) for a negative value. For surface swab assessment, the likelihood ratios ranged from 33.6 (2.1 to 519.8) for a positive gastric aspirate culture to 0.08 (0.006 to 1.12) for microscopy of ear swab material that did not show any neutrophils. Likelihood ratios for combinations of these individual tests ranged from 10.17 (3.64 to 28.41) to 0.47 (0.22 to 1.00).

CONCLUSIONS The methodological quality of studies assessing the accuracy of diagnostic tests is generally poor. Even in rigorous studies, the reported accuracy of the tests varies enormously and they are of limited value in the diagnosis of infection in this population.

  • sepsis
  • bacterial infection
  • diagnostic tests

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Infection in early life is a major cause of mortality and morbidity.1 The symptoms and signs are often non-specific, making diagnosis difficult and the optimal strategy for managing these infants unclear.2-4 To be of practical use, any diagnostic test must fulfil certain criteria: it should accurately predict the presence or absence of infection and be reliable; it should be simple to perform; results should be available quickly; and it should be cost effective. If a test is not sufficiently accurate, then regardless of its other attributes it will be of limited value in clinical practice.

The Evidence Based Medicine Working Group suggest four criteria that will improve the validity of any results from a study assessing the accuracy of a diagnostic test: (1) there should be an independent, blind comparison with a reference standard; (2) the patient sample should include an appropriate spectrum of patients to whom the test will be applied in clinical practice; (3) the results of the test being evaluated should not influence the decision to perform the reference standard; and (4) the test should be described in sufficient detail to permit replication.5 6 In addition, it is suggested that in order to be clinically useful, the data should be presented in such a manner that likelihood ratios can be calculated.7

We carried out a systematic review to determine the methodological quality of clinical research into the accuracy of diagnostic tests for bacterial infection in the first three months of life, and to review the results from the studies most likely to provide valid data.


An extensive search for articles (English only) on the diagnosis of infection in the newborn period and during infancy published between 1966 and April 1995 was made in Medline at the National Library of Medicine, Bethesda, USA, using a strategy (available from the authors) designed to be highly sensitive at identifying articles on diagnosis.8 We identified 670 citations. Letters, editorials, commentaries, and reviews were excluded, leaving 572 articles and abstracts for possible inclusion.

To be included in the review, articles had to report the assessment of a diagnostic “test” (including signs and symptoms) predicting the presence or absence of bacterial infection, and also include extractable data relating to infants up to 90 days of age. Tests for chlamydial infection were included, although chlamydia is not strictly a bacterium. Diagnostic tests for viral infection were excluded as were tests using amniotic fluid or cord blood. The abstract, and if necessary, the complete manuscript of each of the articles identified by the search were assessed independently by the authors. One hundred and ninety four articles met the inclusion criteria. Agreement as to which articles to include or exclude was excellent (Cohen’s κ statistic = 0.85), and any disagreement was resolved by discussion.

The quality (validity) of each article included in the review was assessed independently by the two authors. Three primary criteria were used: (1) Was there an independent blind comparison with a reference (gold) standard? For the purposes of the review, an acceptable diagnostic gold standard was regarded as a diagnosis of infection based on pure growth of an organism from blood, cerebrospinal fluid, urine, or deep tissue culture, or chest x ray changes supported by bacteriological growth from endotracheal tube aspirate; (2) Did the population studied include an appropriate spectrum of babies to whom the test would be applied in clinical practice? (3) Were the results reported in such a manner that they could be expressed as likelihood ratios? Two other secondary criteria were also assessed: was the test described in sufficient detail to allow duplication, and was there any reference to the reliability of the test?

Data from the studies retained after this process were extracted twice by PWF. Where possible, 2 × 2 tables, or equivalent, were created. For tests with a positive or negative result only, the accuracy of each “test” was then reported as sensitivity and specificity, positive and negative predictive values, and the likelihood ratios associated with positive and negative results. In the case of multilevel tests, the results were reported as likelihood ratios associated with each level of test result. Confidence intervals are limited to the likelihood ratio in order to avoid reporting too many data. The precise definitions of these measures and how to use the likelihood ratio are described in the .


Of the 194 studies accepted for inclusion, the authors agreed that 73 (38%) reported an independent, blind comparison with an acceptable reference standard (agreement between authors, κ = 0.30), 148 (76%) studied a population that included an appropriate spectrum of babies to whom the test could be applied in clinical practice (κ = 0.82), and 58 (30%) reported results that could be expressed as a likelihood ratio (κ = 0.56). Initial agreement by the two authors on the secondary criteria suggested that in only 93 studies (51%) was the actual test described in sufficient detail such that it could be repeated; and in only six studies (3%) was there any mention of the reliability of the tests.

In order to minimise the inclusion of studies reporting potentially biased results, it was decided to examine further only those articles that met all three of the primary “quality” criteria. Initially it was agreed that 57 articles appeared to do this; however, during data extraction it became apparent that a further five of these articles did not, in fact, meet all three methodological criteria. After discussion it was decided to drop these five articles, leaving 52 papers for further assessment.9-60

We assessed 155 individual tests in these 52 papers. We present the results of the review of individual tests based on (1) haematological indices, (2) C reactive protein evaluation, and (3) surface swab assessment. These particular data are taken from 23 studies9 11 12 18 21-23 31 33 35 36 38-40 42 45 49 50 52 55 56 58 59and were chosen for reporting here because they relate to tests that are commonly used in clinical practice. Data on diagnostic accuracy when combinations of these tests are used are also presented; these were reported in seven studies.18 31 34 35 40 47 52Data from the remaining 27 papers assessing the other tests (tests specific for group B streptococcal infection, tests to diagnose neonatal conjunctivitis, tests using acute phase proteins other than C reactive protein, combinations of other diagnostic tests, clinical signs used to diagnose infection, and a variety of miscellaneous tests) are available from the authors.

Individual study details are listed in table 1. The accuracy of each of the tests, or combination of tests, reviewed is shown in tables2-5.

Table 1

Characteristics of studies included in the review

Table 2

Accuracy of haematological variables

Table 3

Accuracy of C reactive protein assay as a diagnostic test for bacterial infection

Table 4

Accuracy of surface swabbing (including gastric aspiration) for diagnosing bacterial infection

Table 5

Accuracy of combinations of tests


The methodological difficulties in carrying out systematic reviews and meta-analyses evaluating diagnostic tests have been reported62 and we have addressed many of these in this study.

We used a search strategy designed to be as sensitive as possible8 and we believe that by using this strategy we will have found most of the appropriate studies. However, we did not search any database other than Medline, we did not specifically consult individual authors, and we did not attempt to identify possible unpublished studies. As with all systematic reviews, therefore, it is possible that some data have been missed.

The level of agreement between the investigators on which articles were relevant was exceptionally good. However, when the review process progressed to assessing the quality of individual studies, agreement fell, though it remained acceptable.63 Both authors are experienced at assessing study design, and a significant factor contributing to the level of disagreement was that when reporting this type of study authors are not explicit enough in describing their methods.

The quality of many of the published studies is poor, in keeping with the findings of another review on this subject.64This is well illustrated by the fact that only 52 of a total of 572 reports met all three of the basic criteria designed to minimise the possibility of bias and improve the usefulness of any results. There were various common methodological flaws. Studies not infrequently repeated the test on a number of occasions in individual babies and then reported each result as a unique “event.” Thus if those babies in whom the test was repeated were prone to a particular test result this could bias the overall assessment, although the direction of the bias could not necessarily be determined. The test and the criterion standard were often not independent of one another, again potentially introducing bias if only a given test result led to the gold standard being determined. Many studies also either used obviously septic infants or included a “control group” of perfectly well babies—populations in which the test would be unlikely to be used in clinical practice. Investigators conducting this type of research must consider these methodological issues when designing their studies, and also take care in reporting the methodology used as accurately and explicitly as possible.

We chose to report various measures assessing the accuracy of these tests, including the likelihood ratio. The interpretation of sensitivity and specificity is not intuitive to all clinicians and although positive and negative predictive values are perhaps of more value, they are only applicable in similar populations—their values vary depending on the prevalence of the outcome under consideration. To overcome this problem, use of the likelihood ratio allows clinicians to calculate the post-test probability of the outcome as long as some idea of the prevalence of the condition (pretest probability) is known. Less than one third (58/198) of the studies initially included in the review reported data in a way that allowed likelihood ratios to be calculated, thus limiting the clinical value of the information presented.

We have not carried out a formal meta-analysis on any of these results and have not therefore provided any pooled estimates. After the initial systematic review was complete and the individual studies were available for scrutiny, it was felt that there was too much heterogeneity to justify any meta-analysis. The populations studied all varied in age, gestation, and selection criteria; there were numerous different criterion standards used; and very few of the tests were sufficiently similar, with a variety of cut off values being used. These differences are partly reflected in the heterogeneity of the result of similar tests reported in different studies.

Our choice of gold standard has often been used by others65-67 but does have some theoretical problems. If the number of genuinely infected infants is underrepresented by the gold standard—that is, some infected infants are not identified—then the positive predictive value of the test will be lower than in truth. However, this does not explain the poor negative predictive values that we frequently identified. For clinicians, this feature of the tests makes it very difficult to suggest either not starting treatment or stopping it on the basis of a negative result. Indeed, if some true infection were not picked up by culture, the true negative predictive value of these tests would be even lower than we report. On the other hand, in a small number of cases, bacterial isolates will actually represent poor aseptic technique, not true infection, and under these circumstances the accuracy of the results will be biased in the opposite direction.

Will the use of any of these investigations allow clinicians to alter their management? It has been suggested that a likelihood ratio between 0.1 and 10 is of limited use for predicting the presence or absence of a disease, since it will not substantially alter the pretest probability.7 Apart from a few exceptions, the likelihood ratios calculated from the studies included in this review lie within this indeterminate range and so they appear to be of limited value, either as individual tests or in combination. In our experience, when an infant presents with possible serious bacterial infection, clinicians understandably tend to act conservatively by performing some form of criterion standard (blood culture, urine culture, lumbar puncture, chest x ray, or a combination of these) and often start the infant on antibiotics, at least until the results of the criterion standard are available, when the situation is reviewed. Used singly, the diagnostic tests reported here are unlikely to change the pretest probability of a given child either being infected or not being infected, and so will not be much use in deciding whether to start or stop treatment. Our assessment of combinations of tests—a common clinical practice at present—showed equally disappointing results, although others have suggested this approach may be more promising.4 64 It is important, however, to recognise that regardless of the characteristics of any diagnostic test, the impact of different management strategies on any particular outcome can only truly be assessed by conducting appropriate randomised trials.

The quality of existing studies examining the accuracy of tests used to diagnose infection in the first three months of life is often poor and future studies must be more rigorous. Valid data from existing studies suggest that tests are of limited value in the diagnosis of infection in this population.


Sensitivity = a / a + c, that is, the proportion of cases who are infected and have a positive test.

Specificity = d / d + b, that is, the proportion of cases who are not infected and have a negative test.

Positive predictive value = a / a + b, that is, the proportion of cases with a positive test who are infected.

Negative predictive value = d / d + c, that is the proportion of cases with a negative test who are not infected.

Likelihood ratio for a positive test = [a / a + c] / [b / b + d], that is, post-test odds of infection = pretest odds × likelihood ratio (odds = probability / 1 − probability).*

Likelihood ratio for a negative test = [d / d + b] / [c / a + c], that is, post-test odds of no infection = pretest odds × likelihood ratio (odds = probability / 1 − probability).*

*The mathematics can be avoided by using a nomogram for applying likelihood ratios (Fagan TJ. Nomogram for Bayes’s theorem. N Engl J Med 1975;293:257). A straight line is drawn through the estimated pretest probability that the baby will experience the outcome of interest and the likelihood ratio associated with the given test result. The probability that the baby will now experience the outcome, given that particular test result (post-test probability), can simply be read off the nomogram.