Use R to Create and Execute Reproducible CWL Workflows for Genomic Research

Use R to Create and Execute Reproducible CWL Workflows for Genomic Research


Author(s): Qian Liu

Affiliation(s): Roswell Park Comprehensive Cancer Center

Social media: https://twitter.com/QianLiu28878838

The bioinformatics community increasingly relies on ‘workflow’ frameworks to manage the analysis of large and complex biomedical data. One solution facilitating portable, reproducible and scalable workflows across platforms is the Common Workflow Language (CWL), which has been widely adopted by the community, including large biomedical projects such as The Cancer Genome Atlas and Galaxy, and cloud computing platforms, such as the Cancer Genomics Cloud (CGC) and CAVATICA. However, as a domain-specific language, the implementation of CWL requires a level of expertise that is often beyond the capabilities of genomic researchers and even skilled data scientists. In addition, the impact of CWL pipelines is weakened by poor integration with downstream statistical analysis tools such as R and Bioconductor. Here, we introduce a Bioconductor toolchain for use and development of reproducible, workflow-based bioinformatics pipelines using Rcwl and RcwlPipelines. Rcwl provides a familiar R interface to, and expands the scope of, CWL. It facilitates users to convert command-line based tools, R packages and functions into workflow-based tool recipes in R, and connect them into reproducible analysis pipelines that are ready to be evaluated and submitted within the R environment. RcwlPipelines manages more than 200 pre-built tool and pipeline recipes for commonly used bioinformatics tools, such as BWA and STAR. These recipes are easily queried, used and customized by researchers to fit their own analysis needs. This workshop will demonstrate the use of these tools in the application of RNA-seq data analysis.