Enabling Reusable and Reproducible Genomic Data Management and Analysis in R

Enabling Reusable and Reproducible Genomic Data Management and Analysis in R


Author(s): Qian Liu

Affiliation(s): Roswell Park Comprehensive Cancer Center

Social media: https://twitter.com/QianLiu28878838

Efficient management and analysis of genomic data is becoming increasingly challenging due to the growing volume and complexity of these data and public resources, especially with the widespread adoption of FAIR (findability, accessibility, interoperability, and reusability) data principles and organizational requirements for Data Management and Sharing Plans. Currently, data management and analysis rely heavily on ad hoc analysis scripts implemented on-premises, which are often challenging to reproduce due to their tight coupling with the specific environment. Lack of standardized software tracking and data annotation strategies can also lead to repeated computation and duplicate data files, impeding research productivity and hindering scientific collaboration. “Workflow” framework, such as Common Workflow Language (CWL), connects different command-line tools to create portable and reproducible workflows across diverse platforms. However, the implementation requires a level of expertise that is often beyond capabilities of genomic researchers and even skilled data analysts. In response to these challenges, we have developed Rcwl, an user-friendly R interface for CWL, which enables easy development, use and maintenance of CWL pipelines within R. It enables the conversion of command-line tools and functions into workflow-based and standardized tool recipes that are ready to be evaluated and submitted within the R environment. We have collected and pre-built more than 200 tools and pipelines that are frequently used in genomic research and made them available through the Bioconductor package RcwlPipelines, where they can be easily queried, used or customized by researchers in their own analysis. We then have reimagined the workflow concept within “tool recipes” based on Rcwl in the context of genomic data management, and have developed the Bioconductor package ReUseData, which facilitates the conversion of ad-hoc data processing scripts into workflow-based “data recipes”. With additional data annotation and tracking strategies, ReUseData allows for the reproducible generation of curated data sets and promotes the data reuse in different projects. This is particularly useful for those essential genomic resources, such as variant annotation files from ClinVar and COSMIC, that are required for the interpretation of experiment data into biological meanings. The tool set we are introducing here adopts the workflow infrastructure, containerization strategy and Conda environment to enable users to implement reproducible and streamlined data analysis within a unified R environment. The standardized data management also makes the data more easily reusable, reproducible and interoperable with data analysis tools that are available as R/Bioconductor packages, command-line tools and analysis workflows based on CWL, WDL, Nextflow or snakemake, across diverse computing environments such as personal computers, institute cluster and cloud computing platforms.