Rationale and overview of the screening process
The process of eligibility screening aims to ensure that the eligibility criteria are applied consistently and impartially so as to reduce the risk of introducing errors or bias in an evidence synthesis. Articles identified in searches are typically structured in having a title, an abstract (or summary), and/or a ‘full text’ version such as an academic journal paper, agency report, or internet pages. Eligibility screening can be applied at these different levels of reading to impose a number of filters of increasing rigor and thus screening is normally a stepwise process. The exact approach is a matter of preference, although CEE guidelines recommend that at least two filters are applied: (a) a first reading of titles and abstracts to efficiently remove articles which are clearly irrelevant; and (b) assessment of the full-text version of the article [1].
Depending on the nature of the evidence synthesis question and the number of articles requiring screening, titles and abstracts may be screened separately or together. If only an insignificant number of articles can be excluded on title alone (e.g. as found in a systematic review of the environmental impacts of poverty rights regimes by Ojanen et al. 2017 [24]), then combining the title and abstract screening in a single step may be more efficient. In cases where insufficient information is available in the title or abstract to enable an eligibility decision to be made, or if the abstract is missing, then the full-text version should be obtained and examined. An overview of the eligibility screening process is shown in Fig. 1.
As shown in Fig. 1, the screening process starts out with individual articles but final eligibility decisions are made at the level of studies, taking into account any linked articles that refer to the same study (see “Identifying linked articles” below). The evidence selection decision process is conservative at each step so that only articles which do not meet the inclusion criteria are excluded [1]; in any cases of doubt, articles proceed to the next step for further scrutiny. If after full-text screening the eligibility of a study remains unclear, further information should be sought, if feasible (e.g. by contacting the authors), to enable the study to be included or excluded. Any studies whose eligibility still remains unclear after this process should be listed in an appendix to the systematic review or systematic map report. In systematic reviews, an option could be to include studies of unclear relevance in a sensitivity analysis. The approach for handling unclear studies should be considered during protocol development and specified in the systematic review or systematic map protocol.
A single set of eligibility criteria can be used to screen titles, abstracts and full-text articles (e.g. Rodriguez et al. [23] used the eligibility criteria shown in Box 1 for screening titles and abstracts and then applied the same criteria to full-text articles). However, if the information reported in titles and abstracts is limited it may be efficient to use a smaller subset of the eligibility criteria to screen the titles and/or abstracts, and apply the more detailed full set of eligibility criteria for the screening of full-text articles. Whichever approach is used, the eligibility criteria applied at each step should be clearly stated in the protocol.
Identifying linked articles
If the same data are included more than once in an evidence synthesis this can introduce bias [21, 25, 26]. Therefore, the unit of analysis of interest in a systematic review or map is usually individual primary research studies (e.g. observational studies, surveys, or experiments), rather than individual articles.
Investigators often report the same study in more than one article (e.g. the same study could be reported in different formats such as conference abstracts, reports or journal papers, or in several different journal papers [27], and we refer to these as ‘linked articles’. Although there is often a single article for each study, it should never be assumed that this is the case [3]. Linked articles may range from being duplicates (i.e. they fully overlap and do not contribute any new information) to having very little overlap. Articles which are true duplicates should be removed to avoid double-counting of data. The remaining linked articles which refer to a study should be grouped together and screened for eligibility as a single unit so that all available data pertinent to the study can be considered when making eligibility decisions.
It may be difficult to determine whether articles are linked, as related articles do not always cite each other [28, 29] or share common authors [30]. Some ‘detective’ work (e.g. checking whether the same data appear in more than one article, or contacting authors) may therefore be needed by the review team. Although it would be ideal to identify linked articles that refer to the same study early on the screening process, it may only become clear at the full-text screening stage that articles are linked. Once the links between articles and studies have been identified, a clear record will need to be kept of all articles which relate to each study. This may be done using a separate document or spreadsheet, or using grouping or cross-referencing functions available in bibliographic reference management tools.
Number and expertise of screeners
Eligibility decisions involve judgement and it is possible that errors or bias could be introduced during eligibility screening if the process is not conducted carefully.
Possible problems that could arise at the eligibility screening step are:
-
Some articles might be misclassified due to the way members of the review team interpret the information given in them in relation to the eligibility criteria;
-
One or more articles might be missed altogether, due to human error;
-
Review team members may (knowingly or not) introduce bias into the selection process, since human beings are susceptible to implicit bias and experts in a particular topic often have pre-formed opinions about the relevance and validity of articles [3, 31].
Appropriate allocation of the review team to the eligibility screening task, in terms of the number and expertise of those involved, is important to ensure efficiency [10] and can help to minimise the risk of errors or bias. If any members of the review team are authors of articles identified in the searches then the allocation of screening tasks should ensure that members of the review team do not influence decisions regarding the eligibility of their own articles.
Number of screeners
It has been estimated that when eligibility screening is done by one person, on average 8% of eligible studies would be missed, whereas no studies would be missed when eligibility screening is done by two people working independently [32]. The same authors also suggested that use of two reviewers to screen eligibility increased the number of randomised studies identified by 9%. To ensure reliability of the eligibility screening process, articles providing guidance on conducting systematic reviews in environmental research [2, 11, 12] and health research [14, 15, 18] recommend that eligibility screening should be performed where possible by at least two people. The screeners need not necessarily be the same two people for all articles or for all screening steps. Options could be for one person to screen the articles and the second person to then check the first screener’s decisions; or both screeners may independently perform the selection process and then compare their decisions. Independent screening is preferable since it avoids the possibility that the second screener could be influenced by the first screener’s decision.
The current CEE Guidelines for Systematic Reviews in Environmental Management (version 4.2, March 2013) [1] do not provide recommendations for the number of people who should conduct eligibility screening, although the Guidelines implicitly suggest that a single screener may be acceptable provided that an assessment of screener reliability is conducted. According to the latest CEE evidence synthesis protocols published in Environmental Evidence journal (January–July 2017), screening by a single person, subject to a check of screener reliability using a subset of articles, is the currently practised approach in most cases.
A potential problem with eligibility screening being conducted by a single screener is that any errors in the classification of articles by the screener, or any articles missed from classification, may go undetected, if checking by a second screener is not done on an adequate number of articles. This is why the use of a minimum of two screeners is now considered mandatory in some health research systematic reviews [16, 17]. Reliability checking can be done (e.g. using screener agreement statistics) but this has limitations which should be taken into consideration, as we explain below (see “Assessing screener agreement”).
Eligibility screening can be a time-consuming process, typically taking an hour or more for a screener to assess 200 titles or 20 abstracts [10]. If the evidence base is extensive such that large numbers (e.g. tens of thousands) of articles would need to be screened, it might not always be feasible for two or more screeners to work on all screening steps. Consideration may then need to be given as to whether the systematic review or systematic map question, or the eligibility criteria, should be refined (e.g. narrowing the scope) to make the evidence synthesis manageable within the available resources. Discussion with relevant stakeholders, e.g. research commissioners, may be helpful in resolving any difficulties if the level of rigor expected of eligibility screening will be difficult to achieve within the available resources. Employing a single screener at one or more steps of the eligibility screening process, subject to checking screener reliability, is a pragmatic approach which may be justifiable on a case-by-case basis depending on the nature of the topic and how critical it is to minimise the risk of selection bias [33], but should not be considered as being reflective of best practice (see “Assessing screener agreement” below).
It may be tempting to consider employing a single screener for titles, since the information available in a title is usually relatively limited and titles can often indicate that an article is irrelevant without the need to expend detailed effort in screening [10]. However, selection bias could arise at title screening (just as it could at abstract or full-text screening) if a screener is not impartial, and this could be especially important for evidence syntheses on contentious topics. Furthermore, in our experience it is not uncommon for a small proportion (~1%) of articles to be completely missed from screening by a single reviewer, due to human error (e.g. screener fatigue when assessing thousands of articles). For these reasons, good practice would be to employ a minimum of two screeners at the title screening as well as abstract and full-text screening steps.
For systematic maps the need to minimise selection bias may seem less critical than for systematic reviews, since the output and conclusions of systematic maps are often descriptive. Nevertheless, an underlying expectation of systematic maps is that the searching and eligibility screening steps should be conducted with the same rigor as for systematic reviews [5]. It is therefore good practice in all types of evidence synthesis that at least two people conduct eligibility screening of each article. We recommend that deviations from this should only be made as exceptions, where clear justification can be provided, and is agreed among all relevant stakeholders. This is important for maintaining the integrity of systematic evidence synthesis as a ‘gold standard’ or ‘benchmark’ approach for minimising the risk of introducing errors or bias, and to avoid creating confusion as to whether the methods employed in specific evidence syntheses truly constitute those of a systematic review or systematic map, rather than, for example, a traditional literature review or rapid evidence assessment [10].
If a pragmatic decision is made by the review team to proceed with a systematic review or systematic map involving a large number of articles to screen and to use only one screener for some of the articles then, for consistency with good practice as defined above, the following information should be provided in the protocol and final evidence synthesis report:
-
a clear justification for using one screener to screen all and a second to screen only a sample, stating which steps of the screening process this will be applied to;
-
evidence of the reliability of the approach (i.e. the reliability of the screener’s decisions should be tested and reported; see “Assessing screener agreement” below);
-
acknowledgement that the use of one screener to screen all and a second to screen only a sample at one or more steps of eligibility screening is a limitation (this should be stated in the conclusions section, critical reflection or limitations section, and, if possible, also in the abstract).
Ultimately, it is the review team’s responsibility to ensure that, where possible, methods are used which minimise risks of introducing errors and bias, and that any limitations are justified and transparently reported.
Expertise of screeners
There is no firm ‘rule’ about how many of the screeners should be topic experts. Given the complexity of environmental topics it is important that the team has adequate expertise in evidence synthesis and the question topic to ensure that important factors relating to the evidence synthesis question are not missed [10]. However, topic experts may lack impartiality as they are likely to be very familiar with the literature relevant to the evidence synthesis question which may risk selective screening decisions being made [31]. A pragmatic approach to reduce the risks of any conflicts of interest within a review team could be to include screeners with different backgrounds and expertise, to ensure diversity of stakeholder perspectives.
Assessing screener agreement
An assessment of agreement between screeners during pilot-testing can help to ensure that the eligibility screening process is reproducible and reliable. If necessary, the eligibility criteria and/or screening process may be modified and re-tested to improve the agreement between screeners. Agreement can be assessed by: recording the observed proportions of articles where pairs of screeners agree or disagree on their eligibility decisions; calculating a reviewer agreement statistic; and/or descriptively tabulating and discussing any disagreements.
A widely used statistic for assessing screener agreement is Cohen’s kappa [34], which takes into account the level of agreement between screeners that would occur by chance. But interpretation of kappa scores is subjective since there is no consensus as to which scores indicate ‘adequate’ agreement, and the concept of ‘adequate’ agreement is itself subjective. CEE’s Guidelines for Systematic Reviews in Environmental Management (version 4.2) [1] suggested a minimum Kappa value of 0.5 should be achieved, which was interpreted as indicating ‘substantial agreement’. However, when interpreting screener agreement it should be borne in mind that potentially important discrepancies between screeners can occur even if screener agreement statistics indicate high overall rates of agreement (Box 2).
To assess screener agreement, a sample (as large as possible) of the articles identified in searches should be screened by at least two people and their agreement determined. The size of the sample should be justified by the review team and the articles comprising the subset should be selected randomly to avoid bias towards certain authors, topics, years or other factors.
Use of a kappa statistic to guide pilot-testing of eligibility screening where two or more people will screen each article is a pragmatic approach to optimise efficiency of the process, in which case the limitations of the agreement statistic and its somewhat arbitrary interpretation are not critical. However, recently-published evidence syntheses and protocols indicate that the kappa statistic is increasingly being used for a different purpose: to demonstrate high reviewer agreement in support of employing only one screener to assess the majority of articles. The potential insensitivity of overall screener agreement measures to specific discrepancies in screener agreement (Box 2) suggests that a kappa statistic might not be adequate as a justification that a single screener has sufficient reliability in their screening decisions to protect against the risk of introducing errors or selection bias.
According to the most recently-published protocols, CEE evidence syntheses often assess screener agreement based on a subset of 10% of articles or 100 abstracts (whichever is the larger), although some have used 5% of articles or an unspecified ‘small proportion’ of articles. These subsets seem rather small, and it could be questioned how a review team would be confident in minimising the risk of selection bias if as many as 90% of articles are not checked. Therefore, we recommend that as large a subgroup of articles as possible is screened by at least two reviewers—the ideal would be 100%.
As there is no consensus on what ‘adequate’ rates of agreement are (unless reaching 100%), the review team should justify the level of agreement reached and explain in the evidence synthesis report whether relying on a single screener may have led to any relevant studies being excluded. If so, an explanation should be given as to how this would affect interpretation of the evidence synthesis conclusions. Presentation of a decision matrix showing the combinations of screener agreements (e.g. as in Box 2) may be helpful to support any discussion and interpretation of screener reliability.
Box 2: Example of screener agreement interpretation
Screener agreement is illustrated, for two screeners making three possible eligibility decisions (include, exclude or unclear) on 8000 articles. Data are hypothetical but are reflective of a typical evidence synthesis scenario in which the majority of articles identified in searches are excluded during screening. The overall observed agreement between screeners for these data is 99.4% and Cohen’s kappa is 0.62.
Screener 1
|
Screener 2
|
---|
Include
|
Exclude
|
Unclear
|
Total
|
---|
Include
|
35
|
15
|
3
|
53
|
Exclude
|
18
|
7911
|
4
|
7933
|
Unclear
|
7
|
5
|
2
|
14
|
Total
|
60
|
7931
|
9
|
8000
|
The data illustrate that, despite good overall agreement as indicated by the observed agreement and the kappa score, discrepancies exist in the include/exclude decisions made by screener 1 and screener 2 which could be critical for a systematic review or systematic map (where the aim should be not to miss any relevant articles). In this example, screener 1 excluded 18 of the 60 articles which screener 2 included (30%), whilst screener 2 excluded 15 of the 53 articles which screener 1 included (28%). At these rates of agreement, employing either screener alone could result in different sets of articles being selected for inclusion.
Resolving disagreements
A process for resolving any disagreements between screeners should be agreed by the review team and to ensure consistency this should be pre-specified in the protocol. An approach which appears to be commonly used [35], and which works efficiently in our experience, is that the screeners meet to discuss their disagreements to reach a consensus; if consensus is not reached a third opinion could then be sought, from another member of the review team or the project advisory group. The exact approach is a matter of preference; for example, abstracts over which there is disagreement could be discussed by the screeners before proceeding to the full-text screening step (to avoid obtaining full-text articles unnecessarily), or the articles could be directly passed to the full-text screening step (to enable decisions to be based on all available information). Records of all screening decisions should be kept to ensure that, if necessary, the review team can justify their study selection. Screening decisions can often be recorded conveniently in user-definable fields in reference management tools. Pilot-testing the screening process, described below, can be helpful to identify whether some screeners differ systematically from others in the eligibility decisions they make.