Novel tools and methods for designing and wrangling multifunctional, machine-readable evidence synthesis databases

One of the most important steps in the process of conducting a systematic review or map is data extraction and the production of a database of coding, metadata and study data. There are many ways to structure these data, but to date, no guidelines or standards have been produced for the evidence synthesis community to support their production. Furthermore, there is little adoption of easily machine-readable, readily reusable and adaptable databases: these databases would be easier to translate into different formats by review authors, for example for tabulation, visualisation and analysis, and also by readers of the review/map. As a result, it is common for systematic review and map authors to produce bespoke, complex data structures that, although typically provided digitally, require considerable efforts to understand, verify and reuse. Here, we report on an analysis of systematic reviews and maps published by the Collaboration for Environmental Evidence, and discuss major issues that hamper machine readability and data reuse or verification. We highlight different justifications for the alternative data formats found: condensed databases; long databases; and wide databases. We describe these challenges in the context of data science principles that can support curation and publication of machine-readable, Open Data. We then go on to make recommendations to review and map authors on how to plan and structure their data, and we provide a suite of novel R-based functions to support efficient and reliable translation of databases between formats that are useful for presentation (condensed, human readable tables), filtering and visualisation (wide databases), and analysis (long databases). We hope that our recommendations for adoption of standard practices in database formatting, and the tools necessary to rapidly move between formats will provide a step-change in transparency and replicability of Open Data in evidence synthesis.


Why databases are integral to evidence syntheses
One of the most important steps in the process of conducting a systematic review or map is data extraction: locating and abstracting information and study findings from within each manuscript and entering these data into a specifically designed database. Methodological guidance for systematic reviews and maps advocates that these databases should be predesigned and planned from the outset to maximise consistency, efficiency and objectivity in how data are extracted [1]. However, systematic review authors (referred to hereafter as evidence 'synthesists') typically design databases from scratch for each review project, and there has been no attempt to standardise systematic review data formats. Whilst systematic review management tools (e.g. CADIMA; [2]) each have their own standard format for storing and exporting data extracted within a systematic review or systematic map, there is no effort to ensure consistency across tools.
Furthermore, there is little adoption of easily machinereadable, readily reusable and adaptable databases. Providing digital versions of review or map does not mean that these data are easily understandable, or provided in a format that can be readily reused without considerable time and effort needed to 'wrangle' the data into a usable format. Review and map databases that are provided in a consistent and clear format that obeys certain standards and rules would be easier to translate into different formats: not only by review authors themselves (for example when producing narrative tables, visualisations and continuing to data analysis) but also by readers of the review/ map if they wish to interrogate the data, replicate the analyses or reuse data for other purposes.
There are many ways to structure data, from graph databases [3], delimited text format, to json files [4]. As evidence synthesists, it is often not possible to select a database format that will be optimal for all use cases. Instead, we suggest that the structure the data takes should be carefully considered and planned. Our purpose here is to delineate between structured (i.e. readily machine readable) and visualised (i.e. human readable) data. We believe that both formats are important; and, in particular, stress that human readable tables are not sufficient for computational reproducibility and interoperability.
Efforts have been made to define best practice in computational work with data; a researcher may adhere to, for example, FAIR (findable, accessible, interoperable, and reusable) principles [5], or the TRUST (transparency, responsibility, user focus, sustainability, technology) principles for digital repositories [6], but precisely how to implement these principles with the tools available for a given discipline is left to the researcher.
As the drive towards Open Science and Open Synthesis picks up pace [7,8], consistency in the way the Open Data (the immediate publication of full research data free-of-charge; [9]) are presented is important to maximise transparency, usability and legacy of data from evidence syntheses.
In evidence synthesis, there are very sensible reasons to structure a data table by study, so that a reader can easily read and summarise vertically, horizontally, between studies, and within sub-tables. This structure is optimised for the reader, but it is a significant barrier to the extensibility or reuse of the research, now considered a best practice in scientific computing [10]. Wilson et al. [11] suggest that we should aim for 'good enough' practices. As such, it is likely a researcher reading this manuscript would not be trained in databases, but, like the authors, would be an evidence synthesist, wanting to ensure they practice science diligently. In order to bridge this toolchain (i.e. a set of programming tools) gap between best practices in scientific computing and evidence synthesists' practice, we provide some examples of data structures and tools. Our central focus is to underscore that the choice of data structure for publication should be a fundamental component of the review project planning-outlined in the review protocol-and authors should ensure that: (1) the data are provided in a machine-readable format with human-readable summaries; and (2) the structure is clearly documented (i.e. the data provided are described and explained in detailed meta-data).

Why we should care about good data curation and Open Data
The Open Science movement has been instrumental in raising awareness and interest in the benefits of Open Data across disciplines [12]. Open Science has recently been applied specifically to evidence synthesis via the concept of Open Synthesis [7]. Various benefits to Open Data have been cited, including: increasing opportunities for reuse and further analysis (reducing research waste; [13]); facilitating real-time sharing and use of data [14]; increasing research visibility, discovery, impact and recognition [15]; facilitating research validity through replication and verification [16]; decreasing the risk of research fraud through transparency [17]; use of real research in educational materials [17]; facilitating collaborative research and reducing redundancy and research waste across siloed groups [18]; enabling public understanding [18]; increased potential to impact policy [19]; promotion of citizen science [20]. These benefits are the same for Open Data in evidence synthesis (i.e. systematic reviews and maps). Indeed, synthesists are often plagued by incomplete reporting of data and meta-data in primary studies [21]: we should therefore be keenly aware of our mandate to ensure we comply with Open Data principles for the same reasons.
Systematic reviews and maps involve the management of large, complex datasets, often with multiple dependencies and nesting: for example, multiple outcomes for a single intervention, multiple interventions within a single study, multiple studies within a single manuscript, and multiple manuscripts on single studies. Added to this, synthesists must comply with strict demands for rigour in the way data extraction is planned, executed and reported (e.g. the MECCIR standards for conduct of systematic reviews; https ://onlin elibr ary.wiley .com/page/ journ al/18911 803/homep age/autho r-guide lines ). Careful data curation in evidence synthesis is vital to avoid errors that easily propagate and threaten the validity of these substantial projects: particularly where multiple synthesists are working on the same evidence base simultaneously and over long time periods that frequently involve staff turnover. Where used, systematic review management tools (e.g. SysRev.com) have come a long way to reduce unnecessary errors, but there remains no consistent approach to data extraction across platforms that hampers Open Synthesis [7].

Objectives
As described above, most systematic reviews and maps design bespoke data extraction tools. These databases obviously share similarities, for example citation information, study year, location, etc., and there are necessary differences depending on the focus of the review. However, each database uses different conventions related to column names, cell content formatting and structure. Because of the lack of templates across the evidence synthesis community, synthesists often learn the hard way that databases designed for data extraction are often not suitable for immediate visualisation or analysis. In addition, data cannot be readily combined across databases, meaning that overlapping reviews cannot share resources in data extraction.
We outline below several common problems with systematic review and map databases, and then go on to describe how adopting a standard template for database design can help synthesists to wrangle their data automatically for rapid visualisation and analysis, whilst also facilitating Open Data principles. We have the following objectives that sit within a broader theory of change related to improved data management in evidence syntheses: • to provide recommendations in data formatting and propose a methodology for designing evidence synthesis databases. • stimulate the production of tools to allow rapid reformatting of data between types suitable for reading and types suitable for analysis. • to support Open Synthesis as a community, facilitating synthesis and reuse of syntheses. • improve the legacy and impact of syntheses in the future.
Here, we provide one solution to the current problem of messy data: 'structured' databases, in which data are shared in tables wherein each row denotes an observation and each column a variable [33], and a series of context-agnostic tools to translate databases between different formats for specific uses. Although we focus here on systematic review and map databases, our tools and recommendations are equally applicable to any form of evidence synthesis and usable in any form of database.
There are other options, for example, relational databases and graph databases [22]. Many of these solutions will work well together (e.g. relational databases can be based on these principles), but what we provide here is an easily understandable concept that does not require specialist software or knowledge of databases. It is also based on principles of interoperability with the most common statistical and programming language, R [23]. By building a database using the principles described herein, users can effortlessly move from a data extraction phase into a data synthesis phase (particularly for quantitative synthesis) without the need for manual (time consuming) data manipulation. This solution also allows a single database to be used for multiple types of synthesis; e.g. visualisation in flow diagrams, heat maps, evidence atlases and diagnostic plots as well as data analysis in meta-analysis.
Although our examples are provided in the R coding environment and therefore require some basic knowledge of coding, we also intend to provide user interfaces that allow translation of databases with no prior coding experience necessary.
We begin by introducing common formats for evidence synthesis databases, and then describe how databases have been designed and published in Environmental Evidence. We then outline how existing data science tools can support efficient and transparent database design, translation and use. Finally, we provide recommendations and methods for designing databases and introduce tools for rapidly converting between formats for different purposes.

Evidence synthesis data formats
Broadly speaking, there are four formats for systematic review and map databases: 'condensed' formats, 'wide' formats, 'long' formats, and 'wide-and-long' formats (see Fig. 1).
Condensed formats are easy to comprehend, often with nested column headers to denote related variables. We have termed these 'condensed' because they are used in an attempt to both visualise the dataset in a tabular format for the reader and contain sufficient information to be a comprehensive dataset. They typically revolve around the principle of one-line-per-study, such that the study is implied to be the level of data independence. However, these databases are not readily machine readable (i.e. they cannot be immediately filtered, visualised or analysed) where multiple values exist in one cell, for example where studies contain multiple outcomes, and particularly where these outcomes were measured in different ways.
Where a database contains multiple columns each with multiple values per cell, the implicit linkages between cell values are assumed to apply for all values of all cells: in the example in Fig. 2, each of the three values in column A are associated with each of the three values in column B: Condensed formats are perhaps the most common in CEE reviews (see "The status quo in environmental evidence syntheses" section for an analysis of recent reviews).
Wide format databases also typically consist of each row as a study, but avoid multiple values in a single cell that hamper filtering and linking between variables by separating different levels of a variable across multiple columns. Where a database has 10 variables extracted for each study, and each variable has 5 possible values, the database would require at least 50 columns for these data. In order to clearly denote which columns related to the same variable, some form of nesting is required: visually this could be done by having a hierarchy of column headers, but this cannot be readily read by a machine. Better yet, column names should follow standard naming conventions to display this nesting: for example, the variable 'colour' could be split across three columns named 'colour.yellow' , 'colour.red' , and 'colour.blue' . These naming conventions help to identify nesting by a human and machine reader. Despite not having multiple values per cell, wide formats can still suffer from linkage issues where multiple independent data are coded within a single line.
Long format databases obey the principle of each row being an independent data point. This is typical where a study has only one independent set of data (e.g. one method, one outcome, one time period), but will often involve multiple rows per study. Long databases need to track the relationships between rows carefully, and typically do this by using identification codes to denote a particular article, study, location, etc. within which multiple data points are nested. Careful naming conventions are still important for these databases, but each column will contain all levels of a variable, with multiple levels split across different lines, so naming is simpler.
Wide-and-long formats are a combination of wide and long database formats: for example, they will have a single line per outcome but different factors of a variable will be split across multiple columns.
The various advantages and disadvantages of each database format are outlined in Table 1.

The status quo in environmental evidence syntheses
We undertook an analysis of all systematic reviews and maps published by the Collaboration for Environmental Evidence between May 2012 and May 2019 to investigate how synthesists have structured their databases and to what extent the information is reusable. We identified all reviews and maps by hand searching the journal Environmental Evidence (on 05/05/19). We then examined each review and extracted the meta-data outlined in Table 2. This information was checked by a second reviewer.
We identified a total of 29 systematic reviews and 18 systematic maps (see Additional file 1). A total of 44 of these 47 syntheses provided a digitised database of  were long, 2 was wide-and-long, 4 were relational formats, and 5 provided multiple formats (syntheses that used multiple formats are also counted in individual categories) (see Fig. 3). We noted several other issues that would cause problems in machine readability, most commonly that 11 syntheses used multiple different separators for cell values, for example combining commas and semicolons in the same database. Furthermore, 6 databases included multiple separators in the same column. In one case, coding was used inconsistently within the same column, with the same codes being described using different grammar and spelling [24]. Some authors avoided multiple cell values by using 'multiple' as a cell content, with inherent loss of information (e.g. [25,26]). One synthesis provided the database in a Microsoft Word format with its contents available as image files, and thus entirely non-machine readable [27].
Human readability is classified as both an advantage and disadvantage. Data structure for human interpretation has benefits for summary between and within studies, but it poses significant disadvantages for incorporating into future analyses and codeflows. Extracting embedded variables that have been summarised into one column or row presents an ongoing obstacle for evidence synthesists. The combined use of value separators makes it very challenging to extract single values from a column of a condensed database, particularly where responses are free text strings and may contain common value separators, like commas, semicolons or slashes.

A structured revolution for systematic review and map data
How to structure data in systematic reviews/maps Systematic review authors can easily integrate standard structured data principles when designing and publishing their databases. We provide the following advice on designing and using databases within systematic reviews and maps: • Systematic review authors should consider publishing databases of descriptive information about the included studies (similar to a systematic map database; a list of included studies along with their metadata-i.e. not just the study findings but all other extracted descriptive information alongside each study citation): this is rarely done but increases the utility of the review's contents, by allowing users to learn about the nature of the evidence base as well as its findings. • Provide detailed explanatory notes that describe the column titles, and the database structure and conventions in a data dictionary file or tab of the database worksheet. Within reason, this information should ideally be sufficient to understand the data without needing to carefully read the manuscript.

Standardised formats for database design and naming conventions
Whilst there are a range of naming conventions and options for database structures in systematic reviews, we propose the following practices to act as a standard across the community to maximise efficiency, reuse and training.

Differentiate between databases for visualisation and databases for analysis
The evidence synthesis community should delineate between making databases visually appealing for rapid understanding, versus databases that can be reused and immediately machine read. This should not require double the work: here, we advocate for a single, planned database designed for data extraction that can be wrangled into a variety of formats for different purposes using computational methods.
Synthesists should consider including both visual databases for human-readable tables and machine-readable data tables for analysis and replication: synthesists are already encouraged to provide condensed tables in a narrative synthesis (Collaboration for Environmental Evidence 2018). However, we believe that all tables should be provided in a digital, machine readable format as well as providing in-text tables to maximise legacy and impact of their work. Haddaway et al. Environ Evid (2021) 10:5

Avoid multiple values per cell
Along with human-readable tables, synthesists should also provide either a wide or long (or wide-and-long) format database as an Additional file 1. There should ideally be a move away from the preference for providing only a condensed format that compresses multiple values into a single cell. Despite being visually appealing and easy to populate, these are difficult to interact with.
snake_case naming convention for column/variable names 'Naming conventions' are consistent ways of naming variables in a database that make it easier to recognise and use variable names in software languages like R and Python. In essence these are sets of rules for describing variables that cannot be easily named using single unique words or short strings. They are necessary because spaces within a variable name cause problems for tools that automatically detect where one entity stops and another begins. For example, the variable "study location" would ideally be converted into a single string. This could be done using a range of separators to make it easier for the human coder to use: studloc, study.location, study_location, studyLocation, study-location, etc. Naming conventions also facilitate automatic wrangling of data and make it easier to produce Open Source software to switch between database formats.
There are many different naming conventions; for example, camelCase, snake_case, kebab-case, SCREAMINGCASE, and combinations thereof [28]. Our examples below focus on the statistical programming language R, in which there is a convention of separating words with a full stop: for example, read.csv. However, this creates problems in other languages; in Python, full stops have additional language-specific functionality and create problems when used as variable names. Thus, current best practice is to use snake_case-lower case and separate words with an underscore [29]-thus improving interoperability.
• Consistency in whichever approach is chosen Whatever database format and naming convention is chosen, synthesists should ensure that they use this consistently and do not use multiple formats within a single database (e.g. combining naming conventions or being inconsistent in the use of wide and long formats).

• Develop and use tools to translate between database formats
We have developed a series of tools to support database translation in line with the principles described herein. We also provide a number of 'toolchain walkthroughs' combining existing functions together that demonstrate how they can be applied in practice to existing databases (https ://softl oud.githu b.io/sysre vdata /). We describe the relative roles of each format and example code for converting between them here. These methods combine existing code to wrangle data between formats commonly seen in systematic reviews and maps. The code is freely accessible and adaptable (i.e. published under CC-BY license), and may also be a useful basis for the production of Open Source tools with user interfaces to maximise usability by those with limited coding skills.
In our toolchain walkthroughs, we provide examples of restructuring data to illustrate the nuances and challenges of effectively sharing data for evidence synthesis. Communicating data effectively with collaborators and other scientists is not easy, see, for example, a recent retraction (https ://lasko wskil ab.facul ty.ucdav is.edu/2020/01/29/retra ction s), time for the development of required research software must increasingly be factored into the research plan. That being said, best practice in scientific computing is so challenging that best practices was updated to the more realistic good enough practices [11]. We are thus not claiming that our solutions are perfect, but rather fit-for-purpose, and act as a vital starting point for further development and optimisation. No doubt, code may need adapting for specific contexts, particularly where data has not been structured efficiently from the start.
Here The following function makes use of the separate_ rows() function in the 'tidyr' package [30] to split rows where a specific column has multiple values per cell into different lines. The default separator used below is '|||' , as used in the free-to-use review management tool SysRev (http://sysre v.com). This solution is an example of the type of bespoke wrapper code that can be created using existing tools, such as the metapackages 'metaverse' and 'tidyverse' . Note that though brief, the syntax of each of the tools may take time to familiarise oneself with. This function accepts a dataframe, and splits multiple values (separated by '|||') within a given column into rows, duplicating all other information in the dataframe for that row. The backslashes are necessary 'escape characters' to tell R that '|' does not have a purpose other than being a character in the string. See Fig. 4a for a hypothetical run-through. Example 2 Pivot from one row per level of a variable to one column per level of a variable (long-to-wide) This function takes the long version of the database and converts each unique level of a variable into a column, and populates the resulting columns for rows with corresponding codes. It makes use of the pivot_wider() function from the 'tidyr' package [30]. The resultant database has one line per study, and each column is prefixed by the original column name for clarity. The to_snake_case() function from the 'snakecase' package [31] converts the column names into snake_case (i.e. words separated by underscores). See Fig. 4b for a hypothetical run-through. use by other researchers (and their future selves). They should document this structure with a data dictionary, by stating what each column in a table refers to, and what each row defines. Synthesists should not be afraid to provide multiple tables, that is, a collection of tables. They should provide details as to what information each table provides, and how they connect to each other (which variables are common or primary keys). As the evidence synthesis community is still relatively unaware of data structures, consider providing readers with at least one resource on data structures; for example, entry-level texts such as 'R Packages' [32] and the seminal 'Tidy Data' [33] may be appropriate.

Example 3
Compressing multiple columns into a condensed format (wide-to-condensed) The following function makes use of the 'tidyr' package [30] function unite() to take a wide database, consisting of one column per value of a variable, with different variables indicated by a column name prefix (e.g. 'column1_'), and produces a condensed table, with multiple values per cell, separated by ';' (demonstrating that the user can easily alter the delimiting separator within a translation process). The output retains the same number of rows as the wide dataframe. See Fig. 4c for a hypothetical run-through.

Documentation and resources (data dictionary)
Whatever choices evidence synthesists make in designing their databases, they should structure their data for

Conclusions
It is clear from the literature that Open Data and Open Synthesis bring a range of benefits to the research community and evidence users [7,[34][35][36][37][38][39]. The way in which we secure this Open Synthesis future, however, is up for debate. Our tools aim to facilitate a pathway to a clear and easy future for Open Syntheses by establishing standard practices in the design and publication of databases of evidence produced within systematic review and map projects. We summarise our key recommendations in Table 3.
We appreciate that changing practices across a community will not be an easy task, with a potentially steep learning curve, issues around a lack of awareness, and resistance to change. However, we believe that these obstacles can be easily overcome by providing templates and tools that allow users to design, use and publish databases with ease (https ://softl oud.githu b.io/sysre vdata /). Indeed, by embracing standard approaches to database design, we believe that the job of producing and using databases can be made far easier through automation tools (e.g. using EviAtlas; [40]). We eagerly anticipate future toolchain walkthroughs that provide subdiscipline-or tool-specific code for structuring data for evidence synthesis.
Future work is needed to develop resources to support awareness raising and adoption of these tidy data principles, and efforts are underway to produce these. Simultaneously, digital tools are needed to allow synthesists to immediately convert between human-and machine-readable formats, and between different machine-readable formats (i.e. long to wide) for different purposes. In addition, systematic review and map databases should be static and unchanged unless accompanied by clear methodological justification and reporting via a registered and peer-reviewed protocol and final report: these are the central principles of rigorous evidence synthesis. However, the advent of living systematic reviews [41] (and maps) implies the need for guidance and procedures on how databases should be updated transparently and reliably. Clear versioning may be the solution, but this is outside our intended scope and should be tackled by the living evidence synthesis community.
However, these tools will only be useful if the community of evidence synthesists embraces the need for consistency, the need for Open Data and Open Synthesis and the benefits for tidy systematic review and map databases. In doing so, we can maximise the efficiency of the conduct, updating and upgrading of evidence syntheses.

Table 3 Key recommendations for multifunctional, machine readable systematic review and map databases
Objective Description

Select the right database format
Carefully plan what format databases are the most appropriate for data extraction, visualisation, and analysis 2. Ensure design allows translation Select a database format and population method that allow translation between wide, long and condensed formats. Avoid condensing information where a loss of information or data linkage occurs 3. Choose an appropriate naming convention Particularly for wide formats, use machine readable column names for related variables to facilitate aggregation. Populate cells with actual data labels rather than using binary labels ('0' and '1') to facilitate aggregation 4. Publish the highest resolution database Provide the database with the highest level of resolution and information as a digital, machine readable additional file, from which condensed or other formats can be translated 5. Document your data Include a detailed data dictionary or meta-data file/worksheet is included in the database file that fully describes each variable/column and coding, that is self-sufficient without the need to read the manuscript methods to understand