After initial hesitancy due to fears that this procedure might lead to “cookbook medicine” and others, evidence-based medicine (EBM) is now an accepted principle in all fields of medicine including psychiatry. The essence of the evidence is used by many treatment guidelines to inform clinicians in their daily practice. One not entirely resolved issue is, however, which study or evidence synthesis design should be considered as the highest level of evidence. Early statements from McMaster University in Canada [5] (together with the Cochrane Collaboration, the “cradle” of EBM) suggested systematic reviews with meta-analysis can provide the most robust and reliable evidence, but not all guideline producers are in agreement. This is a timely debate, fuelled by the increasing publication of network meta-analyses, a novel approach which takes the assumptions of meta-analysis one step further [3]. Conventional meta-analyses only average the randomised trials comparing two treatments directly (so-called direct evidence). The major criticism has been that meta-analysis compares “apples and oranges”; are trials sufficiently similar so that they can be summarised or are they “heterogeneous”? Network meta-analysis (also called multiple-treatments meta-analysis) additionally uses “indirect evidence”. For example, if in schizophrenia there were trials that compared olanzapine with quetiapine and trials that compared olanzapine with aripiprazole, but no trials comparing quetiapine with aripiprazole directly, we can estimate quetiapine versus aripiprazole indirectly from the other two direct comparisons (see Fig. 1). There are several strengths and added values of this approach: (a) the indirect evidence can fill in the gaps in the evidence matrix, which allows to come up with hierarchies of which drug is probably the best, second best, third best and so on. This information is urgently needed by guidelines, but cannot really be provided by conventional meta-analysis (now sometimes also called “pairwise meta-analysis”). (b) Network meta-analysis can use all kinds of comparisons simultaneously—single antipsychotics versus placebo [11], head-to-head comparisons of new versus old antipsychotics [13], head-to-head comparisons of new drugs [15] —these separate types of comparisons could heretofore be only summarised in separate meta-analyses and viewed “impressionistically” together afterwards [14]. When the network is well connected and provides both direct (e.g. quetiapine vs. aripiprazole directly head-to-head) and indirect (e.g. quetiapine vs. aripiprazole via olanzapine) comparisons, they can be pooled together in the so-called mixed evidence, thus increasing statistical power and the precision of the estimates [3]. This use of the entire information also allows for more timely recommendations compared to conventional pairwise meta-analyses [16]. The underlying assumption of NMA is whether the indirect evidence validly estimates the differences between treatments. This issue is examined in several ways including statistical tests that compare the direct and indirect evidence for all comparisons where both are available [17].

Fig. 1
figure 1

Principle of the use of indirect evidence in network meta-analysis

Nevertheless, we feel that there are at least two major arguments why network meta-analysis and conventional pairwise meta-analyses should generally be considered the highest level of evidence (Fig. 2).

Fig. 2
figure 2

Proposed evidence hierarchy

  1. 1.

    The first one is a simple, pragmatic argument: Nowadays, there are so many trials available, that it is simply impossible for a guideline team to read them all and to come up with an objective evaluation. For example, the latest network meta-analysis on antipsychotic drugs for schizophrenia comprised 212 blinded trials [12], and the last network meta-analysis on antidepressants for major depressive disorder 117 randomised-controlled trials [2]. Nobody can read all these articles and objectively “synthesise” them narratively. We have shown that abstracts from industry-sponsored trials are often biased, thus to read only the abstracts is not sufficient [9]. Actually, the avalanche of evidence is even a problem now for meta-analyses. In 2010, 11 meta-analyses were published per day, the same amount of randomised-controlled trials published three decades ago [1]. There are often several meta-analyses on the same or similar topics. But, their authors do not always come up with the same conclusions, and it is often unclear whether the reason are slightly different research questions, or different interpretation of the results [8]. We have therefore demanded to make a review of the existing systematic reviews mandatory [8].

  2. 2.

    The second argument is that science always has to start out from the ideal situation. Imagine 10 identical studies. There is no doubt that the pooled estimate of these 10 studies is better evidence than any of the single studies alone. The simple reason is that a bigger sample size increases the precision of the estimate, meaning that we can be more confident about the result. Consider this thought experiment: there is a trial with 10 participants of which seven responded to treatment and an identical trial with 1000 participants of which 700 responded. In both cases, the response rate is 70 %, but obviously, we would trust the large trial more. The same holds true for network meta-analysis. If all trials are identically designed, and the direct and the indirect evidence are “consistent,” then there is no reason to not use it for complementing the evidence. Therefore, nothing generally speaks against considering network meta-analysis as the highest level of evidence. The problem is rather that often the world is not ideal. For example, it is well known from many medical fields that small trials tend to exaggerate treatment effects [4]. Therefore, the results of meta-analyses based on several trials can be completely reversed by one later published, large randomised-controlled trial [10, 19]. In mental health, for example, it has been shown that the results of meta-analyses only get stable once approximately, 1000 participants have been included in them [19]. In our experience, the results of network meta-analysis can also be distorted by small trials or by differences in other trial characteristics such as different study conditions, differences in patient inclusion criteria etc. But, these potential limitations (which may occur or not occur in a specific case) should not be used to a priori preclude network meta-analysis from the top of the evidence hierarchy. In medicine, we should always start from the theoretically best method, and all methods have limitations. For example, similar problems occur in randomised-controlled trials which are preferred by some guideline producers. Can we really assume that the patients included in them are similar enough that we can average the effects in both groups and compare them? The inclusion criteria usually leave a lot of room for variability leading to large standard deviations in psychiatric studies.

Therefore, in our opinion, systematic reviews based on network meta-analyses should generally be the highest level of evidence in treatment guidelines, but we need to assess them carefully and in certain situations (such as if a meta-analysis is mainly composed of small trials), later published well-designed, large randomised-controlled trials may indeed be preferred [6].

Nothing comes in complete black or white, but they come in shades of grey in the real world. It is therefore imperative for evidence users to critically appraise each piece of evidence, be it network meta-analysis, pairwise meta-analysis or randomised-controlled trial. One general problem is that publications on the level of evidence often omit the term “systematic review” before meta-analysis—probably only because otherwise the term gets very they long—but a systematic review process must always be implied because without it, any meta-analyses can be useless and should be disregarded. Checklists to assess the quality of systematic reviews such as the AMSTAR instrument exist, but they only check the methodological quality of a systematic review, for example, whether there was a systematic literature search or whether publication bias been investigated [18]. They do not examine the quality and content of the included studies, which should be assessed with the risk of bias tool (bearing in mind the risk of “garbage in garbage out”). It would be laudable if guideline authors could reassess the included studies themselves, but this requires a lot of expert knowledge, it is time consuming, and it opens the doors for selection bias. We would therefore favour the general application of the GRADE approach [7] which should be ideally already applied by the original systematic review authors, and for which extensions to network meta-analysis has been developed and should be endorsed [14].