Why databases are integral to evidence syntheses
One of the most important steps in the process of conducting a systematic review or map is data extraction: locating and abstracting information and study findings from within each manuscript and entering these data into a specifically designed database. Methodological guidance for systematic reviews and maps advocates that these databases should be predesigned and planned from the outset to maximise consistency, efficiency and objectivity in how data are extracted [1]. However, systematic review authors (referred to hereafter as evidence ‘synthesists’) typically design databases from scratch for each review project, and there has been no attempt to standardise systematic review data formats. Whilst systematic review management tools (e.g. CADIMA; [2]) each have their own standard format for storing and exporting data extracted within a systematic review or systematic map, there is no effort to ensure consistency across tools.
Furthermore, there is little adoption of easily machine-readable, readily reusable and adaptable databases. Providing digital versions of review or map does not mean that these data are easily understandable, or provided in a format that can be readily reused without considerable time and effort needed to ‘wrangle’ the data into a usable format. Review and map databases that are provided in a consistent and clear format that obeys certain standards and rules would be easier to translate into different formats: not only by review authors themselves (for example when producing narrative tables, visualisations and continuing to data analysis) but also by readers of the review/map if they wish to interrogate the data, replicate the analyses or reuse data for other purposes.
There are many ways to structure data, from graph databases [3], delimited text format, to json files [4]. As evidence synthesists, it is often not possible to select a database format that will be optimal for all use cases. Instead, we suggest that the structure the data takes should be carefully considered and planned. Our purpose here is to delineate between structured (i.e. readily machine readable) and visualised (i.e. human readable) data. We believe that both formats are important; and, in particular, stress that human readable tables are not sufficient for computational reproducibility and interoperability.
Efforts have been made to define best practice in computational work with data; a researcher may adhere to, for example, FAIR (findable, accessible, interoperable, and reusable) principles [5], or the TRUST (transparency, responsibility, user focus, sustainability, technology) principles for digital repositories [6], but precisely how to implement these principles with the tools available for a given discipline is left to the researcher.
As the drive towards Open Science and Open Synthesis picks up pace [7, 8], consistency in the way the Open Data (the immediate publication of full research data free-of-charge; [9]) are presented is important to maximise transparency, usability and legacy of data from evidence syntheses.
In evidence synthesis, there are very sensible reasons to structure a data table by study, so that a reader can easily read and summarise vertically, horizontally, between studies, and within sub-tables. This structure is optimised for the reader, but it is a significant barrier to the extensibility or reuse of the research, now considered a best practice in scientific computing [10]. Wilson et al. [11] suggest that we should aim for ‘good enough’ practices. As such, it is likely a researcher reading this manuscript would not be trained in databases, but, like the authors, would be an evidence synthesist, wanting to ensure they practice science diligently. In order to bridge this toolchain (i.e. a set of programming tools) gap between best practices in scientific computing and evidence synthesists’ practice, we provide some examples of data structures and tools. Our central focus is to underscore that the choice of data structure for publication should be a fundamental component of the review project planning—outlined in the review protocol—and authors should ensure that: (1) the data are provided in a machine-readable format with human-readable summaries; and (2) the structure is clearly documented (i.e. the data provided are described and explained in detailed meta-data).
Why we should care about good data curation and Open Data
The Open Science movement has been instrumental in raising awareness and interest in the benefits of Open Data across disciplines [12]. Open Science has recently been applied specifically to evidence synthesis via the concept of Open Synthesis [7]. Various benefits to Open Data have been cited, including: increasing opportunities for reuse and further analysis (reducing research waste; [13]); facilitating real-time sharing and use of data [14]; increasing research visibility, discovery, impact and recognition [15]; facilitating research validity through replication and verification [16]; decreasing the risk of research fraud through transparency [17]; use of real research in educational materials [17]; facilitating collaborative research and reducing redundancy and research waste across siloed groups [18]; enabling public understanding [18]; increased potential to impact policy [19]; promotion of citizen science [20]. These benefits are the same for Open Data in evidence synthesis (i.e. systematic reviews and maps). Indeed, synthesists are often plagued by incomplete reporting of data and meta-data in primary studies [21]: we should therefore be keenly aware of our mandate to ensure we comply with Open Data principles for the same reasons.
Systematic reviews and maps involve the management of large, complex datasets, often with multiple dependencies and nesting: for example, multiple outcomes for a single intervention, multiple interventions within a single study, multiple studies within a single manuscript, and multiple manuscripts on single studies. Added to this, synthesists must comply with strict demands for rigour in the way data extraction is planned, executed and reported (e.g. the MECCIR standards for conduct of systematic reviews; https://onlinelibrary.wiley.com/page/journal/18911803/homepage/author-guidelines). Careful data curation in evidence synthesis is vital to avoid errors that easily propagate and threaten the validity of these substantial projects: particularly where multiple synthesists are working on the same evidence base simultaneously and over long time periods that frequently involve staff turnover. Where used, systematic review management tools (e.g. SysRev.com) have come a long way to reduce unnecessary errors, but there remains no consistent approach to data extraction across platforms that hampers Open Synthesis [7].
Objectives
As described above, most systematic reviews and maps design bespoke data extraction tools. These databases obviously share similarities, for example citation information, study year, location, etc., and there are necessary differences depending on the focus of the review. However, each database uses different conventions related to column names, cell content formatting and structure. Because of the lack of templates across the evidence synthesis community, synthesists often learn the hard way that databases designed for data extraction are often not suitable for immediate visualisation or analysis. In addition, data cannot be readily combined across databases, meaning that overlapping reviews cannot share resources in data extraction.
We outline below several common problems with systematic review and map databases, and then go on to describe how adopting a standard template for database design can help synthesists to wrangle their data automatically for rapid visualisation and analysis, whilst also facilitating Open Data principles. We have the following objectives that sit within a broader theory of change related to improved data management in evidence syntheses:
-
to provide recommendations in data formatting and propose a methodology for designing evidence synthesis databases.
-
stimulate the production of tools to allow rapid reformatting of data between types suitable for reading and types suitable for analysis.
-
to support Open Synthesis as a community, facilitating synthesis and reuse of syntheses.
-
improve the legacy and impact of syntheses in the future.
Here, we provide one solution to the current problem of messy data: ‘structured’ databases, in which data are shared in tables wherein each row denotes an observation and each column a variable [33], and a series of context-agnostic tools to translate databases between different formats for specific uses. Although we focus here on systematic review and map databases, our tools and recommendations are equally applicable to any form of evidence synthesis and usable in any form of database.
There are other options, for example, relational databases and graph databases [22]. Many of these solutions will work well together (e.g. relational databases can be based on these principles), but what we provide here is an easily understandable concept that does not require specialist software or knowledge of databases. It is also based on principles of interoperability with the most common statistical and programming language, R [23]. By building a database using the principles described herein, users can effortlessly move from a data extraction phase into a data synthesis phase (particularly for quantitative synthesis) without the need for manual (time consuming) data manipulation. This solution also allows a single database to be used for multiple types of synthesis; e.g. visualisation in flow diagrams, heat maps, evidence atlases and diagnostic plots as well as data analysis in meta-analysis. Although our examples are provided in the R coding environment and therefore require some basic knowledge of coding, we also intend to provide user interfaces that allow translation of databases with no prior coding experience necessary.
We begin by introducing common formats for evidence synthesis databases, and then describe how databases have been designed and published in Environmental Evidence. We then outline how existing data science tools can support efficient and transparent database design, translation and use. Finally, we provide recommendations and methods for designing databases and introduce tools for rapidly converting between formats for different purposes.