knitr::opts_chunk$set(message = FALSE)

Connecting Bioconductor to other bioinformatics tools using `Rcwl`

Instructor(s) name(s) and contact information

Workshop Description

The workshop is intended for developers only, where detailed instructions are included for how to use Rcwl to wrap command line tools, and build/customize data analysis pipelines. Topics of executing tools/pipeline in parallel are included. The Shiny app is also introduced here, which will go through some improvements later. A user manual for package usage and examples is available on this website

R / Bioconductor packages used

System dependencies * cwltool * docker

Time outline

Activity	Time
Wrap command line tools using `Rcwl`	15 min
Customize your pipelines using `Rcwl`	15 min

Workshop goals and objectives

Learning goals

Understand how to wrap command line tools with Rcwl
Understand how to build bioinformatics pipelines with Rcwl

Learning objectives

Create a basic echo tool using Rcwl
Build a DNAseq pipeline using tools of bwa, sam2bam, sortBam and idxBam.

Rcwl

Introduction to Rcwl

Rcwl setup

The Rcwl package is aimed at a simple and user-friendly way to manage command line tools and build data analysis pipelines in R using Common Workflow Language (CWL). The Rcwl and RcwlPipelines packages are available in Bioc 3.11 and R >= 3.6. You can install them by the BiocManager package.

if (!requireNamespace("RcwlPipelines", quietly = TRUE))
    BiocManager::install(c("Rcwl", "RcwlPipelines"))
library(Rcwl)
library(RcwlPipelines)

System requirements

In addition to the R packages, the following tools are required to be installed to run the examples in this document.

cwltool (>= 1.0.2018)
nodejs
Docker (optional)

The cwltool is the reference implementation of the Common Workflow Language, which is used to run the CWL scripts. The nodejs is required when the CWL scripts use the JavaScript language. The Docker containers simplify software installation and management. A CWL runner can pull the required Docker containers automatically and adjust the paths of input files.

You can find instructions to install these tools here:

First example

Hello world

The main class and constructor function is cwlParam, which wraps a command line tool and its parameters in a cwlParam object. Let’s start with a simple example, echo hello world.

First, we load the package and then define the input parameter for “echo”, a string without a prefix. Just an id option is required.

input1 <- InputParam(id = "sth")

Second, we create a cwlParam object with baseCommand for the command to execute and InputParamList for the input parameters.

echo <- cwlParam(baseCommand = "echo", inputs = InputParamList(input1))

Now we have a command object to run. Let’s send a string “Hello World!” to the object. Without defining the outputs, it will stream standard output to a temporary file by default.

echo$sth <- "Hello World!"
echo

## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: echo 
## inputs:
##   sth (string):  Hello World!
## outputs:
## output:
##   type: stdout

We also defined some convenient accessor function for the cwlParam object.

cwlVersion(echo)  ## "v1.0"

## [1] "v1.0"

cwlClass(echo)    ## "CommandLineTool", can be ""

## [1] "CommandLineTool"

baseCommand(echo) ## "echo"

## [1] "echo"

inputs(echo)      ## command line arguments and values

## inputs:
##   sth (string):  Hello World!

outputs(echo)     ## the name of output results/files to be saved out

## outputs:
## output:
##   type: stdout

                  ## to pass to later steps in pipeline
arguments(echo)   ## command line arguments that pass by default, no

## list()

                  ## need to assign values.

Test run

The function runCWL is used to run the CWL object by invoking the python tool cwltool. The returned value is a list with the shell command executed, temporary output and logs. The output directory is the current folder by default, but it can be changed by setting outdir argument. All standard out and standard error stream can also be printed by setting stderr = "".

r1 <- runCWL(echo, outdir = tempdir())
r1

## List of length 3
## names(3): command output logs

First Let’s check the output file to make sure the command was executed successfully.

r1$output

## [1] "/tmp/RtmpZjf3Pi/962b12fad463694ac70325deb4a59990b82ba543"

readLines(r1$output)

## [1] "Hello World!"

Here is the actual command line code (genearted from running runCWL) that was executed by cwltools.

r1$command

## [1] "\033[1;30mINFO\033[0m [job file4cffa6f1c0.cwl] /tmp/vopg2q67$ echo \\"      
## [2] "    'Hello World!' > /tmp/vopg2q67/962b12fad463694ac70325deb4a59990b82ba543"

The log shows the details of how the cwltool works with CWL scripts.

r1$log

##  [1] "\033[1;30mINFO\033[0m /usr/local/bin/cwltool 3.0.20200807132242"                                                   
##  [2] "\033[1;30mINFO\033[0m Resolved '/tmp/RtmpZjf3Pi/file4cffa6f1c0.cwl' to 'file:///tmp/RtmpZjf3Pi/file4cffa6f1c0.cwl'"
##  [3] "\033[1;30mINFO\033[0m [job file4cffa6f1c0.cwl] /tmp/vopg2q67$ echo \\"                                             
##  [4] "    'Hello World!' > /tmp/vopg2q67/962b12fad463694ac70325deb4a59990b82ba543"                                       
##  [5] "\033[1;30mINFO\033[0m [job file4cffa6f1c0.cwl] completed success"                                                  
##  [6] "{"                                                                                                                 
##  [7] "    \"output\": {"                                                                                                 
##  [8] "        \"location\": \"file:///tmp/RtmpZjf3Pi/962b12fad463694ac70325deb4a59990b82ba543\","                        
##  [9] "        \"basename\": \"962b12fad463694ac70325deb4a59990b82ba543\","                                               
## [10] "        \"class\": \"File\","                                                                                      
## [11] "        \"checksum\": \"sha1$a0b65939670bc2c010f4d5d6a0b3e4e4590fb92b\","                                          
## [12] "        \"size\": 13,"                                                                                             
## [13] "        \"path\": \"/tmp/RtmpZjf3Pi/962b12fad463694ac70325deb4a59990b82ba543\""                                    
## [14] "    }"                                                                                                             
## [15] "}"                                                                                                                 
## [16] "\033[1;30mINFO\033[0m Final process status is success"

Internally, the runCWL generates two scripts: the “.cwl” configuration file and the “.yml” file containing input parameter values. The location and prefix name of these two files are defined with the argument prefix, with the default value as tempfile().

The cwltool parses the two scripts and translates them into shell command shown in r1$command. Since the output prameters are not defined in the echo object, so the output was returned to stdout by default with the filepath in r1$output.

Components

Input Parameters

Essential Input parameters

When defining the input parameters, three arguments are usually needed: id, type, and prefix. The id is a name for the input parameter which could be used later to assign values for that parameter, The type can be string, int, long, float, and double, etc. The prefix is the command line prefix that can be in varied formats based on users’ habit. More detail can be found at: <htps://www.commonwl.org/v1.0/CommandLineTool.html#CWLType>.

Here is an example from the CWL user guide (http://www.commonwl.org/user_guide/03-input/). We defined the echo with different type of input parameters by InputParam, and the stdout option can be used to capture the standard output stream into a file:

e1 <- InputParam(id = "flag", type = "boolean", prefix = "-f")
e2 <- InputParam(id = "string", type = "string", prefix = "-s")
e3 <- InputParam(id = "int", type = "int", prefix = "-i")
e4 <- InputParam(id = "file", type = "File",
                 prefix = "--file=", separate = FALSE)
echoA <- cwlParam(baseCommand = "echo",
                  inputs = InputParamList(e1, e2, e3, e4),
                  stdout = "output.txt")
echoA

## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: echo 
## inputs:
##   flag (boolean): -f 
##   string (string): -s 
##   int (int): -i 
##   file (File): --file= 
## outputs:
## output:
##   type: stdout
## stdout: output.txt

Then let’s assign values to the input parameters.

echoA$flag <- TRUE
echoA$string <- "Hello"
echoA$int <- 1
tmpfile <- tempfile()
write("World", tmpfile)
echoA$file <- tmpfile
r2 <- runCWL(echoA, outdir = tempdir())
r2$command

## [1] "\033[1;30mINFO\033[0m [job file4cf7985c3fe.cwl] /tmp/zkhynmwn$ echo \\"                
## [2] "    --file=/tmp/tmpnd1bwl_n/stg2fa095a3-f489-4959-9b86-484720dab8cd/file4cf511526d0 \\"
## [3] "    -f \\"                                                                             
## [4] "    -i \\"                                                                             
## [5] "    1 \\"                                                                              
## [6] "    -s \\"                                                                             
## [7] "    Hello > /tmp/zkhynmwn/output.txt"

Array Inputs

When one command line parameter takes multiple values, there are three different ways to define an array input.

a1 <- InputParam(id = "A", type = "string[]", prefix = "-A")  ## most common, -A a b c
a2 <- InputParam(id = "B",                                    ## -B=d, -B=e, -B=f
                 type = InputArrayParam(items = "string",
                                        prefix="-B=", separate = FALSE))
a3 <- InputParam(id = "C", type = "string[]", prefix = "-C=", ## -C=g,h,i
                 itemSeparator = ",", separate = FALSE) 
echoB <- cwlParam(baseCommand = "echo",
                 inputs = InputParamList(a1, a2, a3))

We then set values for the three inputs and see how each of them print in the command line:

echoB$A <- echoB$B <- echoB$C <- letters[1:3]
echoB

## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: echo 
## inputs:
##   A (string[]): -A a b c
##   B:
##     type: array
##     prefix: -B= a b c
##   C (string[]): -C= a b c
## outputs:
## output:
##   type: stdout

Now we can check whether the command behaves as we expected.

r3 <- runCWL(echoB, outdir = tempdir())
r3$command

## [1] "\033[1;30mINFO\033[0m [job file4cf443da4fe.cwl] /tmp/xuajzthq$ echo \\"
## [2] "    -A \\"                                                             
## [3] "    a \\"                                                              
## [4] "    b \\"                                                              
## [5] "    c \\"                                                              
## [6] "    -B=a \\"                                                           
## [7] "    -B=b \\"                                                           
## [8] "    -B=c \\"                                                           
## [9] "    -C=a,b,c > /tmp/xuajzthq/57bdf92ebd35aaf8fb67b10f2cc69b021f19dc9e"

readLines(r3$output)

## [1] "-A a b c -B=a -B=b -B=c -C=a,b,c"

Output Parameters

Capturing Output for passing in pipeline

The outputs, similar to the inputs, is a list of output parameters. Three options, id, type and glob, are usually required to be defined. The id and type are similar to InputParam, and the glob option is used to define a pattern to find files relative to the output directory.

Here is an example to unzip a compressed gz file. First, we generate a compressed R script file:

zzfil <- file.path(tempdir(), "sample.R.gz")
zz <- gzfile(zzfil, "w")  ## open for writing in text mode. Can use "open=rt" to "readLines()"
cat("sample(1:10, 5)", file = zz, sep = "\n")    ## a string 
close(zz)

The command line to do this job looks like:

$ gzip -d -c sample.R.gz

We then define a cwlParam object to use “gzip” to uncompress an input file:

z1 <- InputParam(id = "decomp", type = "boolean", prefix = "-d")
z2 <- InputParam(id = "stdout", type = "boolean", prefix = "-c")
z3 <- InputParam(id = "zfile", type = "File")
ofile <- "sample.R"
o1 <- OutputParam(id = "rfile", type = "File", glob = ofile)
gz <- cwlParam(baseCommand = "gzip",
               inputs = InputParamList(z1, z2, z3),
               outputs = OutputParamList(o1),  ## this is to glob the stdout file into "output" folder. 
                stdout = ofile)  ## "stdout" is to specify the stdout file.

Now the gz object can be used to decompress the previously generated compressed file:

gz$decomp <- TRUE   ## $ gzip -d
gz$stdout <- TRUE   ## $ gzip -d -c
gz$zfile <- zzfil   ## gzip -d -c sample.R.gz
r4 <- runCWL(gz, outdir = tempdir())
readLines(r4$output)

## [1] "sample(1:10, 5)"

## print(r4$command)

When building the cwl tool using runCWL, We can use arguments argument to include some parameters with default values so that when users use this tool, they don’t need to assign values for all (especially some trivial) parameters.

z1 <- InputParam(id = "zfile", type = "File")
o1 <- OutputParam(id = "rfile", type = "File", glob = ofile)
gz1 <- cwlParam(baseCommand = "gzip",
               arguments = list("-d", "-c"),
               inputs = InputParamList(z1),
               outputs = OutputParamList(o1),
               stdout = ofile)
gz1

## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: gzip 
## arguments: -d -c 
## inputs:
##   zfile (File):  
## outputs:
## rfile:
##   type: File
##   outputBinding:
##     glob: sample.R
## stdout: sample.R

gz1$zfile <- zzfil
r4a <- runCWL(gz1, outdir = tempdir())
r4a$output

## [1] "/tmp/RtmpZjf3Pi/sample.R"

readLines(r4a$output)

## [1] "sample(1:10, 5)"

To make it for general usage, we can define a pattern with javascript to glob the output, which requires node to be installed in your system PATH:

pfile <- "$(inputs.zfile.path.split('/').slice(-1)[0].split('.').slice(0,-1).join('.'))"

Or we can directly use the CWL built-in file property, nameroot:

pfile <- "$(inputs.zfile.nameroot)"
o2 <- OutputParam(id = "rfile", type = "File", glob = pfile)
req1 <- list(class = "InlineJavascriptRequirement")
gz2 <- cwlParam(baseCommand = c("gzip", "-d", "-c"),
               requirements = list(), ## assign list(req1) if node installed.
               inputs = InputParamList(z1),
               outputs = OutputParamList(o2),
               stdout = pfile)
gz2$zfile <- zzfil
r4b <- runCWL(gz2, outdir = tempdir())

Array Outputs

We can also capture multiple output files by defining a pattern in glob. In the following example, we are only extracting text files to our designated output folder in runCWL(outdir="").

a <- InputParam(id = "a", type = InputArrayParam(items = "string"))
b <- OutputParam(id = "b", type = OutputArrayParam(items = "File"),
                 glob = "*.txt")
touch <- cwlParam(baseCommand = "touch",
                  inputs = InputParamList(a),
                  outputs = OutputParamList(b))
touch$a <- c("a.txt", "b.gz", "c.txt")
r5 <- runCWL(touch, outdir = tempdir())
r5$output

## [1] "/tmp/RtmpZjf3Pi/a.txt" "/tmp/RtmpZjf3Pi/c.txt"

Run approaches

Running Tools in parallel

The CWL can also work in high performance clusters with batch-queuing system, such as SGE, PBS, SLURM and so on, using the Bioconductor package BiocParallel. Here is an example to submit jobs with “Multicore” and “SGE”, separately:

library(BiocParallel)
demolist <- as.list(LETTERS[1:6])
names(demolist) <- LETTERS[1:6]
## submit with mutlicore
res1 <- runCWLBatch(cwl = echo, outdir = tempdir(),
                    inputList = list(content = demolist),
                    BPPARAM = MulticoreParam(4))
readLines(res1[[1]]$output)
## submit with SGE
res2 <- runCWLBatch(cwl = echo, outdir = tempdir(),
                    inputList = list(content = demolist),
                    BPPARAM = BatchtoolsParam(workers = 4, cluster = "sge",
                                              resources = list(queue = "all.q")))

Web Application

Here we build a tool with different types of input parameters, and then use echo to print them out:

e1 <- InputParam(id = "flag", type = "boolean",
                 prefix = "-f", doc = "boolean flag")
e2 <- InputParam(id = "string", type = "string", prefix = "-s")
e3 <- InputParam(id = "option", type = "string", prefix = "-o")
e4 <- InputParam(id = "int", type = "int", prefix = "-i", default = 123)
e5 <- InputParam(id = "file", type = "File",
                 prefix = "--file=", separate = FALSE)
e6 <- InputParam(id = "array", type = "string[]", prefix = "-A",
                 doc = "separated by comma")  ## "doc" 
webEcho <- cwlParam(baseCommand = "echo", id = "webEcho",
                 label = "Test parameter types",
                 inputs = InputParamList(e1, e2, e3, e4, e5, e6),
                 stdout = "output.txt")
webEcho

## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: echo 
## inputs:
##   flag (boolean): -f 
##   string (string): -s 
##   option (string): -o 
##   int (int): -i 123
##   file (File): --file= 
##   array (string[]): -A 
## outputs:
## output:
##   type: stdout
## stdout: output.txt

Some input parameters can be predefined in a list, which will be converted to selected options in the webapp. An upload parameter can be used to generate an upload interface for the file type option. If set asFALSE, the upload field will be text input (file path) instead of file input.

library(shiny)
inputs(webEcho)  ## define 2 options for the input parameter "option"
inputList <- list(option = c("option1", "option2"))
app <- cwlShiny(webEcho, inputList, upload = TRUE)
runApp(app)

Figure 1: Shiny webApp for webEcho.

Wrap R functions

Rcwl is also designed to wrap any R function into a cwl tool and then use directly into a pipeline.

anyfun <- function(x) x
rtool <- cwlParam(baseCommand = anyfun,
                  inputs = InputParamList(InputParam(id="anyinput", type = "string")))
rtool$anyinput <- "abcd"  ## assign values for input parameter
rtoolres <- runCWL(rtool) ## execute the CWL script
readLines(rtoolres$output)  ## check results

## [1] "[1] \"abcd\""

Build a simple DNASeq pipeline

RcwlPipelines package

We have collected all recipes (R scripts)that build CWL tools and pipelines in a GitHub repository now (https://github.com/hubentu/RcwlRecipes), This repository is intended to be a community effort for bioinformatics tools and pipelines built with Rcwl and CWL directly.

Three core functions are defined for convenient usage of these recipes:

cwlUpdate: sync the recipes.
cwlSearch: search tools or pipelines using keywords.
cwlInstall: source the R object of CWL tool or pipeline into R working environment.

Indexing recipe scripts

The cwlUpdate function updates the recipe scripts from the GitHub repository and collects meta data to a local cache internally using the BiocFileCache package. By default the local cache will be created under your home directory. Here we use temporary directory as an example.

tools <- cwlUpdate(cachePath = tempfile())
tools

## class: BiocFileCache
## bfccache: /tmp/RtmpZjf3Pi/file4cf26d2d8cb
## bfccount: 130
## For more information see: bfcinfo() or bfcquery()

bfcinfo(tools)  ## show a tibble list with all available tools. Use

## # A tibble: 130 x 13
##    rid   rname create_time access_time rpath rtype fpath last_modified_t… etag 
##    <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>            <dbl> <chr>
##  1 BFC1  pl_a… 2020-10-15… 2020-10-15… /tmp… local /tmp…               NA <NA> 
##  2 BFC2  pl_A… 2020-10-15… 2020-10-15… /tmp… local /tmp…               NA <NA> 
##  3 BFC3  pl_B… 2020-10-15… 2020-10-15… /tmp… local /tmp…               NA <NA> 
##  4 BFC4  pl_b… 2020-10-15… 2020-10-15… /tmp… local /tmp…               NA <NA> 
##  5 BFC5  pl_b… 2020-10-15… 2020-10-15… /tmp… local /tmp…               NA <NA> 
##  6 BFC6  pl_b… 2020-10-15… 2020-10-15… /tmp… local /tmp…               NA <NA> 
##  7 BFC7  pl_C… 2020-10-15… 2020-10-15… /tmp… local /tmp…               NA <NA> 
##  8 BFC8  pl_G… 2020-10-15… 2020-10-15… /tmp… local /tmp…               NA <NA> 
##  9 BFC9  pl_G… 2020-10-15… 2020-10-15… /tmp… local /tmp…               NA <NA> 
## 10 BFC10 pl_h… 2020-10-15… 2020-10-15… /tmp… local /tmp…               NA <NA> 
## # … with 120 more rows, and 4 more variables: expires <dbl>, Type <chr>,
## #   Command <chr>, Container <chr>

                ## "methods(class="BiocFileCache")" to see all
                ## available functions for an "BiocFileCache" object.
table(bfcinfo(tools)$Type)

## 
## pipeline     tool 
##       26      104

data.frame(bfcinfo(tools)[1,])

##    rid         rname         create_time         access_time
## 1 BFC1 pl_alignMerge 2020-10-15 18:55:31 2020-10-15 18:55:31
##                                                                     rpath rtype
## 1 /tmp/RtmpZjf3Pi/file4cf26d2d8cb/RcwlRecipes-master/Rcwl/pl_alignMerge.R local
##                                                                     fpath
## 1 /tmp/RtmpZjf3Pi/file4cf26d2d8cb/RcwlRecipes-master/Rcwl/pl_alignMerge.R
##   last_modified_time etag expires     Type                Command Container
## 1                 NA <NA>      NA pipeline bwaAlign + mergeBamDup      <NA>

Search by keyword

The function cwlSearch helps to search indexed recipes with keywords. In the following example, let’s find the alignment tool using keywords “bwa” and “mem”.

tl <- cwlSearch(c("bwa", "mem"), tools)
tl$rname

## [1] "tl_bwa"

Load tools and pipelines

The function cwlInstall “installs” the tools or pipelines to current environment using the name or the file path to the script. a

cwlInstall(tl$rname)
cwlInstall(tl$fpath)  ## equivalent
bwa

## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: bwa mem 
## requirements:
## - class: DockerRequirement
##   dockerPull: biocontainers/bwa:v0.7.17-3-deb_cv1
## inputs:
##   threads (int): -t 
##   RG (string): -R 
##   Ref (File):  
##   FQ1 (File):  
##   FQ2 (File?):  
## outputs:
## sam:
##   type: File
##   outputBinding:
##     glob: '*.sam'
## stdout: bwaOutput.sam

Build a pipeline

Piplines can be easily built using the available tools. Here we use the tools for mapping and marking duplicates to build a simple alignment pipeline. First, we need to check whether the required tools (bwa, samtools and picard markduplicates) are available in our repository.

tls <- cwlSearch("bwa|sam2bam|sortBam|samtools_index|markdup", tools) %>%
    dplyr::filter(Type == "tool") %>%
    dplyr::select(rname, rpath, Command, Container)
tls

## # A tibble: 6 x 4
##   rname      rpath                             Command      Container           
##   <chr>      <chr>                             <chr>        <chr>               
## 1 tl_bwa_in… /tmp/RtmpZjf3Pi/file4cf26d2d8cb/… bwa index    biocontainers/bwa:v…
## 2 tl_bwa     /tmp/RtmpZjf3Pi/file4cf26d2d8cb/… bwa mem      biocontainers/bwa:v…
## 3 tl_markdup /tmp/RtmpZjf3Pi/file4cf26d2d8cb/… picard Mark… quay.io/biocontaine…
## 4 tl_sam2bam /tmp/RtmpZjf3Pi/file4cf26d2d8cb/… samtools vi… biocontainers/samto…
## 5 tl_samtoo… /tmp/RtmpZjf3Pi/file4cf26d2d8cb/… samtools in… biocontainers/samto…
## 6 tl_sortBam /tmp/RtmpZjf3Pi/file4cf26d2d8cb/… samtools so… biocontainers/samto…

Let’s load all the tools into R environment.

invisible(sapply(tls$rpath, cwlInstall))

Next, we can define the input parameters for the whole pjipeline.

p1 <- InputParam(id = "threads", type = "int")
p2 <- InputParam(id = "RG", type = "string")
p3 <- InputParam(id = "Ref", type = "File",
                 secondaryFiles =
                     c(".amb", ".ann", ".bwt", ".pac", ".sa"))
p4 <- InputParam(id = "FQ1", type = "File")
p5 <- InputParam(id = "FQ2", type = "File?")

Then we define the pipeline steps using the function Step. This function returns stepParam objects, which contain the internal relation with input and output between steps. In the following example, it defines the pipeline steps to process raw fastqs files into duplicates marked alignments.

## bwa
s1 <- Step(id = "bwa", run = bwa,
           In = list(threads = "threads",
                     RG = "RG",
                     Ref = "Ref",
                     FQ1 = "FQ1",
                     FQ2 = "FQ2"))
## sam to bam
s2 <- Step(id = "sam2bam", run = sam2bam,
           In = list(sam = "bwa/sam"))  ## the input is the output of "bwa"
## sort bam
s3 <- Step(id = "sortBam", run = sortBam,
           In = list(bam = "sam2bam/bam"))
## mark duplicates
## s4 <- Step(id = "markdup", run = markdup,
##            In = list(ibam = "sortBam/sbam",
##                      obam = list(
##                          valueFrom="$(inputs.ibam.nameroot).mdup.bam"),
##                      matrix = list(
##                          valueFrom="$(inputs.ibam.nameroot).markdup.txt")))
## index bam
s4 <- Step(id = "idxBam", run = samtools_index,
           In = list(bam = "sortBam/sbam"))

Last, we define the outputs for the whole pipeline and connect all steps.

## req1 <- list(class = "StepInputExpressionRequirement")
## req2 <- list(class = "InlineJavascriptRequirement")
## outputs
## o1 <- OutputParam(id = "Bam", type = "File", outputSource = "markdup/mBam")
o1 <- OutputParam(id = "Idx", type = "File", outputSource = "idxBam/idx")
## stepParam
Align <- cwlStepParam(requirements = list(),
                      inputs = InputParamList(p1, p2, p3, p4, p5),
                      outputs = OutputParamList(o1))
## build pipeline
Align <- Align + s1 + s2 + s3 + s4

The pipeline is ready for use. We can plot the pipeline with plotCWL from the Rcwl package.

plotCWL(Align)

## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Figure 7: Visualization for the pipeline Align.

Now let’s test the pipeline with some real data.

sysdir <- system.file(package="Bioc2020RCWL")
ids <- c("normal", "tumor")
fq1 <- list.files(file.path(sysdir, "vignettes", "extdata"), pattern="1.fq.gz", full.names = TRUE)
fq2 <- list.files(file.path(sysdir, "vignettes", "extdata"), pattern="2.fq.gz", full.names = TRUE)
fq1 <- as.list(fq1)
fq2 <- as.list(fq2)
rg <- as.list(paste("@RG",
                    paste0("ID:", ids),
                    paste0("SM:", ids), sep = "\\t"))
names(fq1) <- names(fq2) <- names(rg) <- ids
inputList <- list(RG = rg,
                  FQ1 = fq1,
                  FQ2 = fq2)
paramList <- list(threads = 2,
                  Ref = file.path(sysdir, "vignettes", "extdata", "ref.fa"))
result <- runCWLBatch(cwl = Align, outdir = tempdir(), inputList,
                      paramList, BPPARAM = MulticoreParam(2),
                      stderr = "", cwlTemp=TRUE, docker = FALSE)

Check the results:

dir(file.path(tempdir(), "normal"))

Rcwl tutorial

https://hubentu.github.io/Rcwl

Connecting Bioconductor to other bioinformatics tools using Rcwl