Bioc2020RCWL4devel.Rmd
knitr::opts_chunk$set(message = FALSE)
Rcwl
The workshop is intended for developers only, where detailed instructions are included for how to use Rcwl
to wrap command line tools, and build/customize data analysis pipelines. Topics of executing tools/pipeline in parallel are included. The Shiny app is also introduced here, which will go through some improvements later. A user manual for package usage and examples is available on this website
Activity | Time |
---|---|
Wrap command line tools using Rcwl
|
15 min |
Customize your pipelines using Rcwl
|
15 min |
The Rcwl
package is aimed at a simple and user-friendly way to manage command line tools and build data analysis pipelines in R using Common Workflow Language (CWL). The Rcwl
and RcwlPipelines
packages are available in Bioc 3.11 and R >= 3.6. You can install them by the BiocManager
package.
if (!requireNamespace("RcwlPipelines", quietly = TRUE)) BiocManager::install(c("Rcwl", "RcwlPipelines")) library(Rcwl) library(RcwlPipelines)
In addition to the R packages, the following tools are required to be installed to run the examples in this document.
The cwltool
is the reference implementation of the Common Workflow Language, which is used to run the CWL scripts. The nodejs
is required when the CWL scripts use the JavaScript language. The Docker containers simplify software installation and management. A CWL runner can pull the required Docker containers automatically and adjust the paths of input files.
You can find instructions to install these tools here:
The main class and constructor function is cwlParam
, which wraps a command line tool and its parameters in a cwlParam
object. Let’s start with a simple example, echo hello world
.
First, we load the package and then define the input parameter for “echo”, a string without a prefix. Just an id
option is required.
input1 <- InputParam(id = "sth")
Second, we create a cwlParam
object with baseCommand
for the command to execute and InputParamList
for the input parameters.
echo <- cwlParam(baseCommand = "echo", inputs = InputParamList(input1))
Now we have a command object to run. Let’s send a string “Hello World!” to the object. Without defining the outputs, it will stream standard output to a temporary file by default.
echo$sth <- "Hello World!" echo
## class: cwlParam
## cwlClass: CommandLineTool
## cwlVersion: v1.0
## baseCommand: echo
## inputs:
## sth (string): Hello World!
## outputs:
## output:
## type: stdout
We also defined some convenient accessor function for the cwlParam
object.
cwlVersion(echo) ## "v1.0"
## [1] "v1.0"
cwlClass(echo) ## "CommandLineTool", can be ""
## [1] "CommandLineTool"
baseCommand(echo) ## "echo"
## [1] "echo"
inputs(echo) ## command line arguments and values
## inputs:
## sth (string): Hello World!
outputs(echo) ## the name of output results/files to be saved out
## outputs:
## output:
## type: stdout
## to pass to later steps in pipeline arguments(echo) ## command line arguments that pass by default, no
## list()
## need to assign values.
The function runCWL
is used to run the CWL object by invoking the python tool cwltool
. The returned value is a list with the shell command executed, temporary output and logs. The output directory is the current folder by default, but it can be changed by setting outdir
argument. All standard out and standard error stream can also be printed by setting stderr = ""
.
## List of length 3
## names(3): command output logs
First Let’s check the output file to make sure the command was executed successfully.
r1$output
## [1] "/tmp/RtmpZjf3Pi/962b12fad463694ac70325deb4a59990b82ba543"
readLines(r1$output)
## [1] "Hello World!"
Here is the actual command line code (genearted from running runCWL
) that was executed by cwltools
.
r1$command
## [1] "\033[1;30mINFO\033[0m [job file4cffa6f1c0.cwl] /tmp/vopg2q67$ echo \\"
## [2] " 'Hello World!' > /tmp/vopg2q67/962b12fad463694ac70325deb4a59990b82ba543"
The log shows the details of how the cwltool
works with CWL scripts.
r1$log
## [1] "\033[1;30mINFO\033[0m /usr/local/bin/cwltool 3.0.20200807132242"
## [2] "\033[1;30mINFO\033[0m Resolved '/tmp/RtmpZjf3Pi/file4cffa6f1c0.cwl' to 'file:///tmp/RtmpZjf3Pi/file4cffa6f1c0.cwl'"
## [3] "\033[1;30mINFO\033[0m [job file4cffa6f1c0.cwl] /tmp/vopg2q67$ echo \\"
## [4] " 'Hello World!' > /tmp/vopg2q67/962b12fad463694ac70325deb4a59990b82ba543"
## [5] "\033[1;30mINFO\033[0m [job file4cffa6f1c0.cwl] completed success"
## [6] "{"
## [7] " \"output\": {"
## [8] " \"location\": \"file:///tmp/RtmpZjf3Pi/962b12fad463694ac70325deb4a59990b82ba543\","
## [9] " \"basename\": \"962b12fad463694ac70325deb4a59990b82ba543\","
## [10] " \"class\": \"File\","
## [11] " \"checksum\": \"sha1$a0b65939670bc2c010f4d5d6a0b3e4e4590fb92b\","
## [12] " \"size\": 13,"
## [13] " \"path\": \"/tmp/RtmpZjf3Pi/962b12fad463694ac70325deb4a59990b82ba543\""
## [14] " }"
## [15] "}"
## [16] "\033[1;30mINFO\033[0m Final process status is success"
Internally, the runCWL
generates two scripts: the “.cwl” configuration file and the “.yml” file containing input parameter values. The location and prefix name of these two files are defined with the argument prefix
, with the default value as tempfile()
.
The cwltool
parses the two scripts and translates them into shell command shown in r1$command
. Since the output prameters are not defined in the echo
object, so the output was returned to stdout by default with the filepath in r1$output
.
When defining the input parameters, three arguments are usually needed: id, type, and prefix. The id is a name for the input parameter which could be used later to assign values for that parameter, The type can be string, int, long, float, and double, etc. The prefix is the command line prefix that can be in varied formats based on users’ habit. More detail can be found at: <htps://www.commonwl.org/v1.0/CommandLineTool.html#CWLType>.
Here is an example from the CWL user guide (http://www.commonwl.org/user_guide/03-input/). We defined the echo
with different type of input parameters by InputParam
, and the stdout
option can be used to capture the standard output stream into a file:
e1 <- InputParam(id = "flag", type = "boolean", prefix = "-f") e2 <- InputParam(id = "string", type = "string", prefix = "-s") e3 <- InputParam(id = "int", type = "int", prefix = "-i") e4 <- InputParam(id = "file", type = "File", prefix = "--file=", separate = FALSE) echoA <- cwlParam(baseCommand = "echo", inputs = InputParamList(e1, e2, e3, e4), stdout = "output.txt") echoA
## class: cwlParam
## cwlClass: CommandLineTool
## cwlVersion: v1.0
## baseCommand: echo
## inputs:
## flag (boolean): -f
## string (string): -s
## int (int): -i
## file (File): --file=
## outputs:
## output:
## type: stdout
## stdout: output.txt
Then let’s assign values to the input parameters.
echoA$flag <- TRUE echoA$string <- "Hello" echoA$int <- 1 tmpfile <- tempfile() write("World", tmpfile) echoA$file <- tmpfile r2 <- runCWL(echoA, outdir = tempdir()) r2$command
## [1] "\033[1;30mINFO\033[0m [job file4cf7985c3fe.cwl] /tmp/zkhynmwn$ echo \\"
## [2] " --file=/tmp/tmpnd1bwl_n/stg2fa095a3-f489-4959-9b86-484720dab8cd/file4cf511526d0 \\"
## [3] " -f \\"
## [4] " -i \\"
## [5] " 1 \\"
## [6] " -s \\"
## [7] " Hello > /tmp/zkhynmwn/output.txt"
When one command line parameter takes multiple values, there are three different ways to define an array input.
a1 <- InputParam(id = "A", type = "string[]", prefix = "-A") ## most common, -A a b c a2 <- InputParam(id = "B", ## -B=d, -B=e, -B=f type = InputArrayParam(items = "string", prefix="-B=", separate = FALSE)) a3 <- InputParam(id = "C", type = "string[]", prefix = "-C=", ## -C=g,h,i itemSeparator = ",", separate = FALSE) echoB <- cwlParam(baseCommand = "echo", inputs = InputParamList(a1, a2, a3))
We then set values for the three inputs and see how each of them print in the command line:
echoB$A <- echoB$B <- echoB$C <- letters[1:3] echoB
## class: cwlParam
## cwlClass: CommandLineTool
## cwlVersion: v1.0
## baseCommand: echo
## inputs:
## A (string[]): -A a b c
## B:
## type: array
## prefix: -B= a b c
## C (string[]): -C= a b c
## outputs:
## output:
## type: stdout
Now we can check whether the command behaves as we expected.
## [1] "\033[1;30mINFO\033[0m [job file4cf443da4fe.cwl] /tmp/xuajzthq$ echo \\"
## [2] " -A \\"
## [3] " a \\"
## [4] " b \\"
## [5] " c \\"
## [6] " -B=a \\"
## [7] " -B=b \\"
## [8] " -B=c \\"
## [9] " -C=a,b,c > /tmp/xuajzthq/57bdf92ebd35aaf8fb67b10f2cc69b021f19dc9e"
readLines(r3$output)
## [1] "-A a b c -B=a -B=b -B=c -C=a,b,c"
The outputs, similar to the inputs, is a list of output parameters. Three options, id, type and glob, are usually required to be defined. The id and type are similar to InputParam
, and the glob option is used to define a pattern to find files relative to the output directory.
Here is an example to unzip a compressed gz
file. First, we generate a compressed R script file:
zzfil <- file.path(tempdir(), "sample.R.gz") zz <- gzfile(zzfil, "w") ## open for writing in text mode. Can use "open=rt" to "readLines()" cat("sample(1:10, 5)", file = zz, sep = "\n") ## a string close(zz)
The command line to do this job looks like:
$ gzip -d -c sample.R.gz
We then define a cwlParam
object to use “gzip” to uncompress an input file:
z1 <- InputParam(id = "decomp", type = "boolean", prefix = "-d") z2 <- InputParam(id = "stdout", type = "boolean", prefix = "-c") z3 <- InputParam(id = "zfile", type = "File") ofile <- "sample.R" o1 <- OutputParam(id = "rfile", type = "File", glob = ofile) gz <- cwlParam(baseCommand = "gzip", inputs = InputParamList(z1, z2, z3), outputs = OutputParamList(o1), ## this is to glob the stdout file into "output" folder. stdout = ofile) ## "stdout" is to specify the stdout file.
Now the gz
object can be used to decompress the previously generated compressed file:
gz$decomp <- TRUE ## $ gzip -d gz$stdout <- TRUE ## $ gzip -d -c gz$zfile <- zzfil ## gzip -d -c sample.R.gz r4 <- runCWL(gz, outdir = tempdir()) readLines(r4$output)
## [1] "sample(1:10, 5)"
## print(r4$command)
When building the cwl tool using runCWL
, We can use arguments
argument to include some parameters with default values so that when users use this tool, they don’t need to assign values for all (especially some trivial) parameters.
z1 <- InputParam(id = "zfile", type = "File") o1 <- OutputParam(id = "rfile", type = "File", glob = ofile) gz1 <- cwlParam(baseCommand = "gzip", arguments = list("-d", "-c"), inputs = InputParamList(z1), outputs = OutputParamList(o1), stdout = ofile) gz1
## class: cwlParam
## cwlClass: CommandLineTool
## cwlVersion: v1.0
## baseCommand: gzip
## arguments: -d -c
## inputs:
## zfile (File):
## outputs:
## rfile:
## type: File
## outputBinding:
## glob: sample.R
## stdout: sample.R
## [1] "/tmp/RtmpZjf3Pi/sample.R"
readLines(r4a$output)
## [1] "sample(1:10, 5)"
To make it for general usage, we can define a pattern with javascript to glob the output, which requires node
to be installed in your system PATH:
pfile <- "$(inputs.zfile.path.split('/').slice(-1)[0].split('.').slice(0,-1).join('.'))"
Or we can directly use the CWL built-in file property, nameroot
:
pfile <- "$(inputs.zfile.nameroot)" o2 <- OutputParam(id = "rfile", type = "File", glob = pfile) req1 <- list(class = "InlineJavascriptRequirement") gz2 <- cwlParam(baseCommand = c("gzip", "-d", "-c"), requirements = list(), ## assign list(req1) if node installed. inputs = InputParamList(z1), outputs = OutputParamList(o2), stdout = pfile) gz2$zfile <- zzfil r4b <- runCWL(gz2, outdir = tempdir())
We can also capture multiple output files by defining a pattern in glob
. In the following example, we are only extracting text files to our designated output folder in runCWL(outdir="")
.
a <- InputParam(id = "a", type = InputArrayParam(items = "string")) b <- OutputParam(id = "b", type = OutputArrayParam(items = "File"), glob = "*.txt") touch <- cwlParam(baseCommand = "touch", inputs = InputParamList(a), outputs = OutputParamList(b)) touch$a <- c("a.txt", "b.gz", "c.txt") r5 <- runCWL(touch, outdir = tempdir()) r5$output
## [1] "/tmp/RtmpZjf3Pi/a.txt" "/tmp/RtmpZjf3Pi/c.txt"
The CWL can also work in high performance clusters with batch-queuing system, such as SGE, PBS, SLURM and so on, using the Bioconductor package BiocParallel
. Here is an example to submit jobs with “Multicore” and “SGE”, separately:
library(BiocParallel) demolist <- as.list(LETTERS[1:6]) names(demolist) <- LETTERS[1:6] ## submit with mutlicore res1 <- runCWLBatch(cwl = echo, outdir = tempdir(), inputList = list(content = demolist), BPPARAM = MulticoreParam(4)) readLines(res1[[1]]$output) ## submit with SGE res2 <- runCWLBatch(cwl = echo, outdir = tempdir(), inputList = list(content = demolist), BPPARAM = BatchtoolsParam(workers = 4, cluster = "sge", resources = list(queue = "all.q")))
Here we build a tool with different types of input parameters, and then use echo
to print them out:
e1 <- InputParam(id = "flag", type = "boolean", prefix = "-f", doc = "boolean flag") e2 <- InputParam(id = "string", type = "string", prefix = "-s") e3 <- InputParam(id = "option", type = "string", prefix = "-o") e4 <- InputParam(id = "int", type = "int", prefix = "-i", default = 123) e5 <- InputParam(id = "file", type = "File", prefix = "--file=", separate = FALSE) e6 <- InputParam(id = "array", type = "string[]", prefix = "-A", doc = "separated by comma") ## "doc" webEcho <- cwlParam(baseCommand = "echo", id = "webEcho", label = "Test parameter types", inputs = InputParamList(e1, e2, e3, e4, e5, e6), stdout = "output.txt") webEcho
## class: cwlParam
## cwlClass: CommandLineTool
## cwlVersion: v1.0
## baseCommand: echo
## inputs:
## flag (boolean): -f
## string (string): -s
## option (string): -o
## int (int): -i 123
## file (File): --file=
## array (string[]): -A
## outputs:
## output:
## type: stdout
## stdout: output.txt
Some input parameters can be predefined in a list, which will be converted to selected options in the webapp. An upload
parameter can be used to generate an upload interface for the file type option. If set asFALSE
, the upload field will be text input (file path) instead of file input.
library(shiny) inputs(webEcho) ## define 2 options for the input parameter "option" inputList <- list(option = c("option1", "option2")) app <- cwlShiny(webEcho, inputList, upload = TRUE) runApp(app)
webEcho
.
Rcwl
is also designed to wrap any R function into a cwl tool and then use directly into a pipeline.
anyfun <- function(x) x rtool <- cwlParam(baseCommand = anyfun, inputs = InputParamList(InputParam(id="anyinput", type = "string"))) rtool$anyinput <- "abcd" ## assign values for input parameter rtoolres <- runCWL(rtool) ## execute the CWL script readLines(rtoolres$output) ## check results
## [1] "[1] \"abcd\""
We have collected all recipes (R scripts)that build CWL tools and pipelines in a GitHub repository now (https://github.com/hubentu/RcwlRecipes), This repository is intended to be a community effort for bioinformatics tools and pipelines built with Rcwl and CWL directly.
Three core functions are defined for convenient usage of these recipes:
cwlUpdate
: sync the recipes.cwlSearch
: search tools or pipelines using keywords.cwlInstall
: source the R object of CWL tool or pipeline into R working environment.The cwlUpdate
function updates the recipe scripts from the GitHub repository and collects meta data to a local cache internally using the BiocFileCache
package. By default the local cache will be created under your home directory. Here we use temporary directory as an example.
## class: BiocFileCache
## bfccache: /tmp/RtmpZjf3Pi/file4cf26d2d8cb
## bfccount: 130
## For more information see: bfcinfo() or bfcquery()
bfcinfo(tools) ## show a tibble list with all available tools. Use
## # A tibble: 130 x 13
## rid rname create_time access_time rpath rtype fpath last_modified_t… etag
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 BFC1 pl_a… 2020-10-15… 2020-10-15… /tmp… local /tmp… NA <NA>
## 2 BFC2 pl_A… 2020-10-15… 2020-10-15… /tmp… local /tmp… NA <NA>
## 3 BFC3 pl_B… 2020-10-15… 2020-10-15… /tmp… local /tmp… NA <NA>
## 4 BFC4 pl_b… 2020-10-15… 2020-10-15… /tmp… local /tmp… NA <NA>
## 5 BFC5 pl_b… 2020-10-15… 2020-10-15… /tmp… local /tmp… NA <NA>
## 6 BFC6 pl_b… 2020-10-15… 2020-10-15… /tmp… local /tmp… NA <NA>
## 7 BFC7 pl_C… 2020-10-15… 2020-10-15… /tmp… local /tmp… NA <NA>
## 8 BFC8 pl_G… 2020-10-15… 2020-10-15… /tmp… local /tmp… NA <NA>
## 9 BFC9 pl_G… 2020-10-15… 2020-10-15… /tmp… local /tmp… NA <NA>
## 10 BFC10 pl_h… 2020-10-15… 2020-10-15… /tmp… local /tmp… NA <NA>
## # … with 120 more rows, and 4 more variables: expires <dbl>, Type <chr>,
## # Command <chr>, Container <chr>
## "methods(class="BiocFileCache")" to see all ## available functions for an "BiocFileCache" object. table(bfcinfo(tools)$Type)
##
## pipeline tool
## 26 104
data.frame(bfcinfo(tools)[1,])
## rid rname create_time access_time
## 1 BFC1 pl_alignMerge 2020-10-15 18:55:31 2020-10-15 18:55:31
## rpath rtype
## 1 /tmp/RtmpZjf3Pi/file4cf26d2d8cb/RcwlRecipes-master/Rcwl/pl_alignMerge.R local
## fpath
## 1 /tmp/RtmpZjf3Pi/file4cf26d2d8cb/RcwlRecipes-master/Rcwl/pl_alignMerge.R
## last_modified_time etag expires Type Command Container
## 1 NA <NA> NA pipeline bwaAlign + mergeBamDup <NA>
The function cwlSearch
helps to search indexed recipes with keywords. In the following example, let’s find the alignment tool using keywords “bwa” and “mem”.
## [1] "tl_bwa"
The function cwlInstall
“installs” the tools or pipelines to current environment using the name or the file path to the script. a
cwlInstall(tl$rname) cwlInstall(tl$fpath) ## equivalent bwa
## class: cwlParam
## cwlClass: CommandLineTool
## cwlVersion: v1.0
## baseCommand: bwa mem
## requirements:
## - class: DockerRequirement
## dockerPull: biocontainers/bwa:v0.7.17-3-deb_cv1
## inputs:
## threads (int): -t
## RG (string): -R
## Ref (File):
## FQ1 (File):
## FQ2 (File?):
## outputs:
## sam:
## type: File
## outputBinding:
## glob: '*.sam'
## stdout: bwaOutput.sam
Piplines can be easily built using the available tools. Here we use the tools for mapping and marking duplicates to build a simple alignment pipeline. First, we need to check whether the required tools (bwa, samtools and picard markduplicates) are available in our repository.
tls <- cwlSearch("bwa|sam2bam|sortBam|samtools_index|markdup", tools) %>% dplyr::filter(Type == "tool") %>% dplyr::select(rname, rpath, Command, Container) tls
## # A tibble: 6 x 4
## rname rpath Command Container
## <chr> <chr> <chr> <chr>
## 1 tl_bwa_in… /tmp/RtmpZjf3Pi/file4cf26d2d8cb/… bwa index biocontainers/bwa:v…
## 2 tl_bwa /tmp/RtmpZjf3Pi/file4cf26d2d8cb/… bwa mem biocontainers/bwa:v…
## 3 tl_markdup /tmp/RtmpZjf3Pi/file4cf26d2d8cb/… picard Mark… quay.io/biocontaine…
## 4 tl_sam2bam /tmp/RtmpZjf3Pi/file4cf26d2d8cb/… samtools vi… biocontainers/samto…
## 5 tl_samtoo… /tmp/RtmpZjf3Pi/file4cf26d2d8cb/… samtools in… biocontainers/samto…
## 6 tl_sortBam /tmp/RtmpZjf3Pi/file4cf26d2d8cb/… samtools so… biocontainers/samto…
Let’s load all the tools into R environment.
Next, we can define the input parameters for the whole pjipeline.
p1 <- InputParam(id = "threads", type = "int") p2 <- InputParam(id = "RG", type = "string") p3 <- InputParam(id = "Ref", type = "File", secondaryFiles = c(".amb", ".ann", ".bwt", ".pac", ".sa")) p4 <- InputParam(id = "FQ1", type = "File") p5 <- InputParam(id = "FQ2", type = "File?")
Then we define the pipeline steps using the function Step
. This function returns stepParam
objects, which contain the internal relation with input and output between steps. In the following example, it defines the pipeline steps to process raw fastqs files into duplicates marked alignments.
## bwa s1 <- Step(id = "bwa", run = bwa, In = list(threads = "threads", RG = "RG", Ref = "Ref", FQ1 = "FQ1", FQ2 = "FQ2")) ## sam to bam s2 <- Step(id = "sam2bam", run = sam2bam, In = list(sam = "bwa/sam")) ## the input is the output of "bwa" ## sort bam s3 <- Step(id = "sortBam", run = sortBam, In = list(bam = "sam2bam/bam")) ## mark duplicates ## s4 <- Step(id = "markdup", run = markdup, ## In = list(ibam = "sortBam/sbam", ## obam = list( ## valueFrom="$(inputs.ibam.nameroot).mdup.bam"), ## matrix = list( ## valueFrom="$(inputs.ibam.nameroot).markdup.txt"))) ## index bam s4 <- Step(id = "idxBam", run = samtools_index, In = list(bam = "sortBam/sbam"))
Last, we define the outputs for the whole pipeline and connect all steps.
## req1 <- list(class = "StepInputExpressionRequirement") ## req2 <- list(class = "InlineJavascriptRequirement") ## outputs ## o1 <- OutputParam(id = "Bam", type = "File", outputSource = "markdup/mBam") o1 <- OutputParam(id = "Idx", type = "File", outputSource = "idxBam/idx") ## stepParam Align <- cwlStepParam(requirements = list(), inputs = InputParamList(p1, p2, p3, p4, p5), outputs = OutputParamList(o1)) ## build pipeline Align <- Align + s1 + s2 + s3 + s4
The pipeline is ready for use. We can plot the pipeline with plotCWL
from the Rcwl
package.
plotCWL(Align)
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Align
.
Now let’s test the pipeline with some real data.
sysdir <- system.file(package="Bioc2020RCWL") ids <- c("normal", "tumor") fq1 <- list.files(file.path(sysdir, "vignettes", "extdata"), pattern="1.fq.gz", full.names = TRUE) fq2 <- list.files(file.path(sysdir, "vignettes", "extdata"), pattern="2.fq.gz", full.names = TRUE) fq1 <- as.list(fq1) fq2 <- as.list(fq2) rg <- as.list(paste("@RG", paste0("ID:", ids), paste0("SM:", ids), sep = "\\t")) names(fq1) <- names(fq2) <- names(rg) <- ids inputList <- list(RG = rg, FQ1 = fq1, FQ2 = fq2) paramList <- list(threads = 2, Ref = file.path(sysdir, "vignettes", "extdata", "ref.fa")) result <- runCWLBatch(cwl = Align, outdir = tempdir(), inputList, paramList, BPPARAM = MulticoreParam(2), stderr = "", cwlTemp=TRUE, docker = FALSE)
Check the results: