Connecting Bioconductor to other bioinformatics tools using Rcwl

Instructor(s) name(s) and contact information

Workshop Description

This workshop introduces the Bioconductor toolchain for usage and development of reproducible bioinformatics pipelines using packages of Rcwl and RcwlPipelines. The Common Workflow Language (CWL) is an open standard for the development of data analysis workflows that is portable and scalable across different tools and working environments. Rcwl provides a simple way to wrap command line tools and build CWL data analysis pipelines programmatically within R. It increases the ease of use, development, and maintenance of CWL pipelines, and offers a higher performance by intuitively supporting parallel work with high performance computing (HPC). More than a hundred of pre-built and tested CWL tool/pipeline recipes are included in RcwlPipelines package. These tools and pipelines are highly modularized for easy customization of complex bioinformatics analysis. Here we included a scRNAseq preprocessing pipeline that uses STARsolo for alignment and quantification, and DropletUtils for filtering raw gene-barcode matrix and removing empty droplets. This pipeline demonstrates the typical use case of our packages. More details for usage and examples are available on Rcwl website: https://hubentu.github.io/Rcwl/.

Pre-requisites

  • Basic knowledge of using R and Bioconductor packages
  • Basic familiarity with running command-line tools
  • No prior experience with CWL is necessary!

Workshop Participation

Participants will be able to try out all of the functionality described. Active user participation throughout the event is highly encouraged including but not limited to lecture material, hands-on sections and final discussion.

Some basic ideas about how CWL works. * https://www.commonwl.org/user_guide/

R / Bioconductor packages used

System dependencies * cwltool * docker

Time outline

Activity Time
Overview of bioinformatics pipelines and CWL (ppt) 10 min
Use core functions to update, search and load tools/pipelines 5 min
Use single cell indexing tool STARindex 5 min
Use single cell alignment tool STARsolo 5 min
Use single cell filtering tool DropletUtils 5 min
Use single cell preprocessing pipeline STARsoloDropletUtils 5 min
Wrap command line tools using Rcwl 0m (developers only)
Customize your pipelines using Rcwl 0m (developers only)

Workshop goals and objectives

Learning goals

  • Basic knowledge of Common Workflow Language (CWL)
  • Knowledge of R/Bioconductor interface of CWL
  • Usage of the pre-built bioinformatics tools and pipelines in R
  • Understand how to wrap command line tools with Rcwl (developers only)
  • Understand how to build bioinformatics pipelines with Rcwl (developers only)

Learning objectives

  • Run the scRNA-seq indexing tool STARindex in R
  • Run the scRNA-seq alignment tool STARsolo in R
  • Run the scRNA-seq QC tool DropletUtils in R
  • Run the scRNA-seq preprocessing pipeline which combines alignment and QC steps in R
  • Create a basic echo tool using Rcwl (developers only)
  • Build a basic DNAseq pipeline using RCWL (developers only)

Introduction to Rcwl & RcwlPipelines

Please see details in the slides.

single cell RNA sequencing data

Data resources

We have prepared all the files needed in this workshop. By initiating the docker image for this workshop, you should already have it available in the filepath (path) below. We have also mounted your local working directory to this rstudio session (outpath) to store the intermediate or result files from running the examples.

path <- "~/inst/testdata"  ## source data
outpath <- "~/outdir"      ## output files
dir.create(outpath, showWarnings = FALSE)

The single cell data resource we are using today is the 1k PBMCs from 10x genomics, which consists of 1000 Peripheral blood mononuclear cells (PBMCs) extracted from a healthy donor, where PBMCs are primary cells with relatively small amounts of RNA (~1pg RNA/cell).

The source material consists of 6 FASTQ files split into two sequencing lanes L001 and L002, each with three reads of R1 (barcodes), R2 (cDNA sequences), I1 (illumina lane infold). 10x Genomics has its own processing pipeline, Cell Ranger to process the scRNA-seq outputs it produces (and requires many configurations to run and is significantly slower than other mappers), which requires all three files to perform the demultiplexing and quantification, but STARsolo does not require the I1 lane file to perform the analysis. The source files can be found here.

For this tutorial, we will use datasets sub-sampled from the source files to contain only 15 cells instead of 1000. We have also further curated the data to only include reads on chromosome 21, so that the real execution of our CWL tools/pipelines in R can be done within 1~2 minutes for each step.

subset15_chr21_pbmc_1k_v3_S1_L001_R1_001.fastq.gz
subset15_chr21_pbmc_1k_v3_S1_L001_R2_001.fastq.gz
subset15_chr21_pbmc_1k_v3_S1_L002_R1_001.fastq.gz
subset15_chr21_pbmc_1k_v3_S1_L002_R2_001.fastq.gz

For the mapping, we need a “whitelist” of known cell barcodes. Here we used the 15 barcodes that we already subsetted. The original 737,000 barcodes can be freely extracted from the Cell Ranger pipeline.

subset15_demo_barcode.txt

The barcodes in the R1 FASTQ data are checked against these known cell barcodes in order to assign a specific read to a specific known cell. The barcodes are designed in such a manner that there is virtually no chance that they will align to a place in the reference genome.

In this tutorial, we will be using the hg19 (GRCh37) version of the human genome, and will therefore also need to use a hg19 GTF file to annotate our reads (on chromosome 21 only for this demo).

chr21.fa
Homo_sapiens.GRCh37.75.21.gtf

Rcwl version of scRNA-seq preprocessing steps

In this tutorial, we are basically trying to reproduce the GALAXY pipeline for scRNA-seq data pre-processing. We will use STARsolo to produce a count matrix from FASTQ, and DropletUtils to produce a high-quality count matrix with feature/cell annotation files saved in an R object of SingleCellExperiment. Before these 2 steps, we have also added a one-time indexing step in our workshop.

With Rcwl, we can easily write cwl tools and pipelines programmatically in R (the other vignette called “Bioc2020RCWL4devel” covers more of this topic for developers). We have pre-built and tested more than a hundred of tools/pipelines which can be loaded using RcwlPipelines (tools and pipelines developing scripts can be found here: https://github.com/hubentu/RcwlRecipes/tree/master/Rcwl).

For usage of the existing tools/pipelines, Three major steps are needed: 1) Search and load the tools/pipelines, 2) assign values for each of the defined parameters, 3) execute the tools/pipelines. All steps are done programmatically in R, and then we can get the results ready in the user-specified directory.

Loading the tools

Here we show the usage of 3 core functions: cwlUpdate, cwlSearch and cwlLoad for updating, searching, and loading the needed tools or pipelines in R.

  • cwlUpdate

The cwlUpdate syncs the current Rcwl recipes and return a BiocFileCache object which contains the most updated Rcwl recipes.

library(Rcwl)
library(RcwlPipelines)
atls <- cwlUpdate()  ## sync the tools/pipelines
atls
## class: BiocFileCache
## bfccache: /github/home/.cache/Rcwl
## bfccount: 130
## For more information see: bfcinfo() or bfcquery()

the bfcinfo() function returns a BiocFileCache tibble containing all information about each available tool or pipeline. Currently, we have integrated 103 command line tools and 26 pipelines.

bfcinfo(atls)
## # A tibble: 130 x 13
##    rid   rname create_time access_time rpath rtype fpath last_modified_t… etag 
##    <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>            <dbl> <chr>
##  1 BFC1  pl_a… 2020-10-15… 2020-10-15… /git… local /git…               NA <NA> 
##  2 BFC2  pl_A… 2020-10-15… 2020-10-15… /git… local /git…               NA <NA> 
##  3 BFC3  pl_B… 2020-10-15… 2020-10-15… /git… local /git…               NA <NA> 
##  4 BFC4  pl_b… 2020-10-15… 2020-10-15… /git… local /git…               NA <NA> 
##  5 BFC5  pl_b… 2020-10-15… 2020-10-15… /git… local /git…               NA <NA> 
##  6 BFC6  pl_b… 2020-10-15… 2020-10-15… /git… local /git…               NA <NA> 
##  7 BFC7  pl_C… 2020-10-15… 2020-10-15… /git… local /git…               NA <NA> 
##  8 BFC8  pl_G… 2020-10-15… 2020-10-15… /git… local /git…               NA <NA> 
##  9 BFC9  pl_G… 2020-10-15… 2020-10-15… /git… local /git…               NA <NA> 
## 10 BFC10 pl_h… 2020-10-15… 2020-10-15… /git… local /git…               NA <NA> 
## # … with 120 more rows, and 4 more variables: expires <dbl>, Type <chr>,
## #   Command <chr>, Container <chr>
table(bfcinfo(atls)$Type)
## 
## pipeline     tool 
##       26      104
  • cwlSearch

We can use (multiple) keywords to search for any specific tools/pipelines of interest, which internally search the keywords against the BiocFileCache tibble in columns of “rname”, “rpath”, “fpath”, “Command” and “Containers”.

tls <- cwlSearch(c("STAR", "index"))
data.frame(tls)
##      rid        rname         create_time         access_time
## 1  BFC18 pl_rnaseq_Sf 2020-10-15 18:51:17 2020-10-15 18:51:17
## 2 BFC111 tl_STARindex 2020-10-15 18:51:36 2020-10-15 18:51:36
##                                                             rpath rtype
## 1 /github/home/.cache/Rcwl/RcwlRecipes-master/Rcwl/pl_rnaseq_Sf.R local
## 2 /github/home/.cache/Rcwl/RcwlRecipes-master/Rcwl/tl_STARindex.R local
##                                                             fpath
## 1 /github/home/.cache/Rcwl/RcwlRecipes-master/Rcwl/pl_rnaseq_Sf.R
## 2 /github/home/.cache/Rcwl/RcwlRecipes-master/Rcwl/tl_STARindex.R
##   last_modified_time etag expires     Type
## 1                 NA <NA>      NA pipeline
## 2                 NA <NA>      NA     tool
##                                                                                                                                                Command
## 1 fastqc + STAR + sortBam + samtools_index + samtools_flagstat + featureCounts + RSeQC + gtfToGenePred + genePredToBed + read_distribution + gCoverage
## 2                                                                                                                                                 STAR
##                              Container
## 1                                 <NA>
## 2 quay.io/biocontainers/star:2.7.5a--0
  • cwlLoad

The last core function is cwlLoad which sources the Rcwl script for the interested tool/pipeline into the R working environment.

cwlLoad(tls$rname[2])  ## "tl_STARindex"
## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: STAR 
## requirements:
## - class: DockerRequirement
##   dockerPull: quay.io/biocontainers/star:2.7.5a--0
## arguments: --runMode genomeGenerate 
## inputs:
##   genomeDir (string): --genomeDir STARindex
##   genomeFastaFiles (File): --genomeFastaFiles 
##   sjdbGTFfile (File): --sjdbGTFfile 
##   runThreadN (int): --runThreadN 4
## outputs:
## outIndex:
##   type: Directory
##   outputBinding:
##     glob: $(inputs.genomeDir)
cwlLoad(tls$fpath[2])  ## equivalent to the above. 
## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: STAR 
## requirements:
## - class: DockerRequirement
##   dockerPull: quay.io/biocontainers/star:2.7.5a--0
## arguments: --runMode genomeGenerate 
## inputs:
##   genomeDir (string): --genomeDir STARindex
##   genomeFastaFiles (File): --genomeFastaFiles 
##   sjdbGTFfile (File): --sjdbGTFfile 
##   runThreadN (int): --runThreadN 4
## outputs:
## outIndex:
##   type: Directory
##   outputBinding:
##     glob: $(inputs.genomeDir)
STARindex
## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: STAR 
## requirements:
## - class: DockerRequirement
##   dockerPull: quay.io/biocontainers/star:2.7.5a--0
## arguments: --runMode genomeGenerate 
## inputs:
##   genomeDir (string): --genomeDir STARindex
##   genomeFastaFiles (File): --genomeFastaFiles 
##   sjdbGTFfile (File): --sjdbGTFfile 
##   runThreadN (int): --runThreadN 4
## outputs:
## outIndex:
##   type: Directory
##   outputBinding:
##     glob: $(inputs.genomeDir)

Now the Rcwl version of STARindex is successfully loaded into R and ready to use.

Indexing

Before the alignment and QC in the aboved mentioned GALAXY pipeline, we will add an indexing step using the tool STARindex. The command line for indexing looks like this:

$ STAR --runMode genomeGenerate --runThreadN 4 --genomeDir STARindex 
--genomeFastaFiles chr21.fa --sjdbGTFfile Homo_sapiens.GRCh37.75.21.gtf

Using the Rcwl version of STARindex, we can equivalently do the same indexing within R which was internally passed as a cwl script. What we need to do is assign values for the input parameters, and execute the cwl script using runCWL function.

cwlVersion(STARindex)
## [1] "v1.0"
inputs(STARindex)
## inputs:
##   genomeDir (string): --genomeDir STARindex
##   genomeFastaFiles (File): --genomeFastaFiles 
##   sjdbGTFfile (File): --sjdbGTFfile 
##   runThreadN (int): --runThreadN 4
outputs(STARindex)
## outputs:
## outIndex:
##   type: Directory
##   outputBinding:
##     glob: $(inputs.genomeDir)
requirements(STARindex)
## [[1]]
## [[1]]$class
## [1] "DockerRequirement"
## 
## [[1]]$dockerPull
## [1] "quay.io/biocontainers/star:2.7.5a--0"

Now let’s assign values for these parameters.

inputs(STARindex)
STARindex$genomeFastaFiles <- file.path(path, "chr21.fa")
STARindex$sjdbGTFfile <- file.path(path, "Homo_sapiens.GRCh37.75.21.gtf")
STARindex  ## values (filepatt) are added
writeCWL(STARindex, prefix = file.path(outpath, "STARindex"))  ## 2 files .cwl/.yml 

The runCWL(docker = ) takes 3 values: - TRUE: on the HPC, runtime is docker, and will pull docker images for the required command line tools. - FALSE: on your local computer, with all command line tools pre-installed. - “singularity”: On the HPC, runtime is singularity.

system.time(
    runCWL(cwl = STARindex, outdir = file.path(outpath, "STARindex_output"),
           docker = FALSE, showLog = TRUE)
    ) ## elapsed: 74.795S
dir(file.path(outpath, "STARindex_output"), recursive = TRUE)

Now that the output files are generated in the folder of “STARindex_output/STARindex” under outpath. These output files are ready to pass as input to the next tool for single cell read alignment.

Alignment

To align single cell reads using STAR, it can be done directly in command line:

$ STAR --soloType CB_UMI_Simple --genomeDir STARindex_output/STARindex --soloCBwhitelist subset15_demo_barcode.txt --soloUMIlen 12 --readFilesIn subset15_chr21_pbmc_1k_v3_S1_L001_R2_001.fastq.gz,subset15_chr21_pbmc_1k_v3_S1_L002_R2_001.fastq.gz subset15_chr21_pbmc_1k_v3_S1_L001_R1_001.fastq.gz,subset15_chr21_pbmc_1k_v3_S1_L002_R1_001.fastq.gz --soloUMIfiltering MultiGeneUMI --soloCBmatchWLtype 1MM_multi_pseudocounts --readFilesCommand gzcat --runThreadN 1

Now we can follow the previous example to load our Rcwl version of STARsolo and run the command in R.

STARsolo <- cwlLoad("tl_STARsolo")  ## "tl_STARsolo"
STARsolo
## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: STAR 
## requirements:
## - class: DockerRequirement
##   dockerPull: quay.io/biocontainers/star:2.7.5a--0
## arguments: --readFilesCommand zcat --soloUMIfiltering MultiGeneUMI --soloCBmatchWLtype 1MM_multi_pseudocounts 
## inputs:
##   readFilesIn_cdna (File[]): --readFilesIn 
##   readFilesIn_cb (File[]):  
##   genomeDir (Directory): --genomeDir 
##   whiteList (File): --soloCBwhitelist 
##   soloType (string): --soloType CB_UMI_Simple
##   soloUMIlen (string): --soloUMIlen 12
##   runThreadN (int): --runThreadN 1
## outputs:
## outAlign:
##   type: File
##   outputBinding:
##     glob: '*.sam'
## outLog:
##   type: File[]
##   outputBinding:
##     glob: Log*
## SJ:
##   type: File
##   outputBinding:
##     glob: SJ.out.tab
## Solo:
##   type: Directory
##   outputBinding:
##     glob: Solo.out
inputs(STARsolo)
## inputs:
##   readFilesIn_cdna (File[]): --readFilesIn 
##   readFilesIn_cb (File[]):  
##   genomeDir (Directory): --genomeDir 
##   whiteList (File): --soloCBwhitelist 
##   soloType (string): --soloType CB_UMI_Simple
##   soloUMIlen (string): --soloUMIlen 12
##   runThreadN (int): --runThreadN 1
cdna.fastq <- file.path(path, list.files(path, pattern = "_R2_"))
cb.fastq <- file.path(path, list.files(path, pattern = "_R1_"))
cblist <- file.path(path, "subset15_demo_barcode.txt")
genomeDir <- file.path(outpath, "STARindex_output/STARindex")

STARsolo$readFilesIn_cdna <- cdna.fastq
STARsolo$readFilesIn_cb <- cb.fastq
STARsolo$whiteList <- cblist
STARsolo$genomeDir <- genomeDir
STARsolo
writeCWL(STARsolo, prefix = file.path(outpath, "STARsolo"))
runCWL(STARsolo, outdir = file.path(outpath, "STARsolo_output"),
       docker = FALSE, showLog = TRUE)  ## approximately 2 mins
dir(file.path(outpath, "STARsolo_output"), recursive = TRUE)

Now we have all files generated in the “STARsolo_output” folder under outpath, which can be passed into the next tool.

DropletUtils

To get a high quality count matrix we must apply the DropletUtils tool, which will produce a filtered dataset that is more representative of the Cell Ranger pipeline.

Since CWL itself doesn’t support the integration of R packages or R function, this is a unique feature for Rcwl, where we can easily connect the upstream data preprocessing steps (previously based on command line tools) and the downstream data analysis steps (heavily done in R/Bioconductor).

The idea here is to put anything you want to do into an R function, with specified arguments for input and output files, then it’s ready to be wrapped as an Rcwl tools for execution.

For example, in using the package DropletUtils (https://github.com/hubentu/RcwlRecipes/blob/master/Rcwl/tl_DropletUtils.R), we mainly did 3 things: 1) use the read10xCounts function to read the output files from the alignment step and convert into a SingleCellExperiment object. 2) Calculate the barcode ranks and plotting. 3) calcualte the empty droplets and plotting.

cwlLoad("tl_DropletUtils")
## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: R function 
## inputs:
##   dirname (Directory): dir.name= 
##   lower (int): lower= 100
##   df (int): df= 20
## outputs:
## plots:
##   type: File
##   outputBinding:
##     glob: '*.pdf'
## outsce:
##   type: File
##   outputBinding:
##     glob: '*.rds'
inputs(DropletUtils)
## inputs:
##   dirname (Directory): dir.name= 
##   lower (int): lower= 100
##   df (int): df= 20
DropletUtils$dirname <- file.path(outpath, "STARsolo_output/Solo.out")
DropletUtils$lower <- 100
DropletUtils$df <- 5
runCWL(DropletUtils, outdir = file.path(outpath, "dropletUtils_output"), showLog = TRUE)
dir(file.path(outpath, "dropletUtils_output"))

Now that we get 2 output files:

  • The pdf file with 2 diagnostic figures: Barcode ranks, and empty droplets. Details for interpretation of each diagnostic figure please refer to the DropletUtils vignette
  • The SingleCellExperiment object which has filtered out unqualified cells and analysis ready.
#> Biobase::openPDF(file.path(outpath, "dropletUtils_output/diagnostics.pdf"))
sce <- readRDS(file.path(outpath, "dropletUtils_output/sce_filtered.rds"))
sce

scRNA-seq preprocessing pipeline

Alternatively and more easily, we can connect these tools and make a pipeline, which is already available in RcwlPipelines called pl_STARsoloDropletUtils. It has integrated the STARsolo and DropletUtils for a streamlined pre-processing analysis within R. These pipelines are ready for customization for your own research.

In a pipeline, we only need to assign input values for the whole pipeline, not individual tools involved. The input and output between each step are pre-defined in the pipeline to ensure a smooth passing.

cwlLoad("pl_STARsoloDropletUtils")
## class: cwlStepParam 
##  cwlClass: Workflow 
##  cwlVersion: v1.0 
## inputs:
##   fastq_cdna (File[]):  
##   fastq_cb (File[]):  
##   genomeDir (Directory):  
##   whiteList (File):  
##   runThreadN (int):  
## outputs:
## sam:
##   type: File
##   outputSource: STARsolo/outAlign
## Solo:
##   type: Directory
##   outputSource: STARsolo/Solo
## sce:
##   type: File
##   outputSource: DropletUtils/outsce
## plots:
##   type: File
##   outputSource: DropletUtils/plots
## steps:
##   STARsolo:
##     run: STARsolo.cwl
##     readFilesIn_cdna: fastq_cdna
##     readFilesIn_cb: fastq_cb
##     genomeDir: genomeDir
##     whiteList: whiteList
##     runThreadN: runThreadN
##     out: outAlign outLog SJ Solo
##   DropletUtils:
##     run: DropletUtils.cwl
##     dirname: STARsolo/Solo
##     out: plots outsce
plotCWL(STARsoloDropletUtils)
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
inputs(STARsoloDropletUtils)
STARsoloDropletUtils$fastq_cdna <- cdna.fastq
STARsoloDropletUtils$fastq_cb <- cb.fastq
STARsoloDropletUtils$genomeDir <- file.path(outpath, "STARindex_output/STARindex")
STARsoloDropletUtils$whiteList <- cblist
STARsoloDropletUtils$runThreadN <- 1
runCWL(STARsoloDropletUtils, outdir = file.path(outpath, "scpipeline_output"),
       docker = FALSE, showLog = TRUE)

The overall output of the pipeline was pre-defined to glob the important files from separate steps.

dir(file.path(outpath, "scpipeline_output"), recursive = TRUE)
outputs(STARsoloDropletUtils)

Summary and some additional comments

There are some other cool functionalities that we haven’t included in this workshop. For example, the runCWLBatch() function is designed for execution of CWL pipelines in high performance computing with support of different job submitting systems, such as slurm, SGE, etcs. The cwlShiny opens a user-friendly shiny interface for any Rcwl tools.

As a summary, this workshop we have introduced the usage of two packages: Rcwl and RcwlPipelines in constructing and executing the CWL tools/pipelines with R for the previously command line tools as well as customized R functions. The pre-built tools and pipelines are highly modularized and optimized for easy customization for specific data analysis needs. These packages are under active development, and we welcome any question for the functionalities, feature requests (support site, email), and issue reports (https://github.com/hubentu/Rcwl/issues, https://github.com/hubentu/RcwlPipelines/issues).

Most importantly, we are trying to make this project as a community effort for developing and sharing of specific sets of tools and pipelines in their bioinformatics domains. We look forward to any collaborations in developing the pipelines and please feel free to make your pull requests for your recipes here.

Acknowledgement

This work was supported by Clinical and Translational Science Award [UL1TR001412 or KL2TR001413] to the University at Buffalo from the NIH National Center for Advancing Translational Sciences.