Workflows are ubiquitous in the data science ecosystem. The ability to automate repetitive tasks to build complex pipelines, schedule and distribute tasks to cloud infrastructures have popularized the use of workflow engine and somehow contributing to reducing the risk of errors associated with human operator fatigue. Workflow engines such as Galaxy 2, Snakemake5, Cromwell7, Knime3, Apache Airflow1, and Toil 6 to name a few offerings, have popularized the use of workflows in the field of life science computational applications. This however be can also become a source of difficulty when buying-in in a particular platform and then trying to exchange information with other platforms or migration away from the initial choice. Hence, a community of experts has dedicated efforts to define open specifications for the description of workflows as well as supporting tools, such as converters.
Using an example based on Next Generation Sequencing (NGS) application, the present content will show the reader how to
make workflow more interoperable and reusable thanks to the use of existing, off-the-shelf tools.
1. CWL: Common Workflow Language - A brief overview¶
CWL, short for Common Workflow Language, is an open standard developed by a consortium of experts, including workflow engine developers, data scientists, data analysts and bioinformaticians.
CWL specifications are available from:
CWL use YAML syntax to describe workflow steps, tools, input, output and parameters.
CWL is meant to provide for platform-independent workflow description, meaning that people should ideally describe workflows once to be able to execute them on CWL aware workflow engines.
CWL is currently implemented by an increasing number of platforms, which are listed here
CWL user guide is available here:
See a CWL example
cwlVersion: v1.0
class: Workflow
- class: SubworkflowFeatureRequirement
- class: ScatterFeatureRequirement
- class: StepInputExpressionRequirement
- class: InlineJavascriptRequirement
- class: MultipleInputFeatureRequirement
type: Directory
label: "BOWTIE indices folder"
doc: "Path to BOWTIE generated indices folder"
type: File
label: "Annotation file"
format: ""
doc: "Tab-separated input annotation file"
type: string
label: "Effective genome size"
doc: "MACS2 effective genome size: hs, mm, ce, dm or number, for example 2.7e9"
type: File
label: "Chromosome length file"
format: ""
doc: "Chromosome length file"
type: File?
default: null
label: "Control BAM file"
format: ""
doc: "Control BAM file file for MACS2 peak calling"
type: File
label: "FASTQ input file"
format: ""
doc: "Reads data in a FASTQ format, received after single end sequencing"
type: boolean?
default: false
label: "Remove duplicates"
doc: "Calls samtools rmdup to remove duplicates from sortesd BAM file"
type: int?
default: 2
doc: "Number of threads for those steps that support multithreading"
label: "Number of threads"
type: File
format: ""
label: "BigWig file"
doc: "Generated BigWig file"
outputSource: bam_to_bigwig/bigwig_file
type: File
label: "FASTQ statistics"
format: ""
doc: "fastx_quality_stats generated FASTQ file quality statistics file"
outputSource: fastx_quality_stats/statistics_file
type: File
label: "BOWTIE alignment log"
format: ""
doc: "BOWTIE generated alignment log"
outputSource: bowtie_aligner/log_file
run: ./tools/extract-fastq.cwl
compressed_file: fastq_file
out: [fastq_file]
run: ./tools/fastx-quality-stats.cwl
input_file: extract_fastq/fastq_file
out: [statistics_file]
2. Conditional Workflow and the CWL when
When describing a protocol, it is often desirable to what to do if a specific situation arises. Computational workflows are no different, and it is in fact quite frequent to have the need to define specific sets of steps if a threshold or condition is met. Therefore, the Common Workflow Language contains a dedicated keyword when to represent such situations. The following block shows how it can be used with a example:
class: Workflow
cwlVersion: v1.2
val: int
in1: val
a_new_var: val
run: foo.cwl
when: $(inputs.in1 < 1)
out: [out1]
in1: val
a_new_var: val
run: foo.cwl
when: $(inputs.a_new_var > 2)
out: [out1]
type: string
- step1/out1
- step2/out1
pickValue: first_non_null
InlineJavascriptRequirement: {}
MultipleInputFeatureRequirement: {}
3. Semantic Markup of CWL workflows¶
CWL documents can be annotated with or EDAM vocabulary elements to support findability.
The blocks of code below shows how this is done with 2 examples.
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool
label: An example tool demonstrating metadata.
doc: Note that this is an example and the metadata is not necessarily consistent.
coresMin: 4
type: File
label: Aligned sequences in BAM format
format: edam:format_2572
position: 1
baseCommand: [ wc, -l ]
stdout: output.txt
type: stdout
format: edam:format_1964
label: A text file that contains a line count
- class: s:Person
s:name: Denis Yuen
- class: s:Person
s:name: Brian O'Connor
s:dateCreated: "2016-12-13"
s:keywords: edam:topic_0091 , edam:topic_0622
s:programmingLanguage: C
**View another example
s:name: "biowardrobe_chipseq_se"
class: s:CreativeWork
s:name: Common Workflow Language
- class: s:Organization
s:legalName: "Cincinnati Children's Hospital Medical Center"
- class: s:PostalAddress
s:addressCountry: "USA"
s:addressLocality: "Cincinnati"
s:addressRegion: "OH"
s:postalCode: "45229"
s:streetAddress: "3333 Burnet Ave"
s:telephone: "+1(513)636-4200"
s:logo: ""
- class: s:Organization
s:legalName: "Allergy and Immunology"
- class: s:Organization
s:legalName: "Barski Research Lab"
- class: s:Person
s:name: Michael Kotliar
- id:
doc: |
The workflow is used to run CHIP-Seq basic analysis with single-end input FASTQ file.
In outputs it returns coordinate sorted BAM file alongside with index BAI file, quality
statistics of the input FASTQ file, reads coverage in a form of bigWig file, peaks calling
data in a form of narrowPeak or broadPeak files.
s:about: |
The workflow is a CWL version of a Python pipeline from BioWardrobe (Kartashov and Barski, 2015).
It starts by extracting input FASTQ file (in case it was compressed). Next step runs BowTie
(Langmead et al., 2009) to perform alignment to a reference genome, resulting in an unsorted SAM file.
The SAM file is then sorted and indexed with samtools (Li et al., 2009) to obtain a BAM file and a BAI index.
Next MACS2 (Zhang et al., 2008) is used to call peaks and to estimate fragment size. In the last few steps,
the coverage by estimated fragments is calculated from the BAM file and is reported in bigWig format. The pipeline
also reports statistics, such as read quality, peak number and base frequency, and other troubleshooting information
using tools such as fastx-toolkit and bamtools.
4. Publishing Workflows as CWL in¶
Workflows are digital objects which can and should be preserved.
A number of repositories exist and may be used to deposit workflows.
One may use a generic repository such as Zenodo to do so (see recipe Depositing to generic repositories - Zenodo use case).
Preferably, one should use a specialized repository such as, which is presented below.

Fig. 48 The european workflowhub website 1.¶

Fig. 49 The european workflowhub website 2.¶
5. Tools: Apache AIRflow playing with CWL¶
Apache Airflow is a platform created by the community to programmatically author, schedule and monitor workflows , to quote the project’s site. It has established itself in industry settings and has broad uptake.
Apache Airflow represents workflows as Directed Acyclic Graph (or DAGs) and Airflow allows the serialization of these as JSON documents.
The main thing about Apache Airflow is that code is used to generate the workflows. For more information, refer to this tutorial:
A tool developed by Michael Kotliar, Andrey V Kartashov, Artem Barski brings CWL support to the Apache Airflow framework, meaning that CWL expressed workflow can now be executed on the platform 4.

Fig. 50 the CWL-Airflow component.¶
A key step in this linkage is the conversion of a CWL expressed workflow into an Apache Airflow DAG, which can then be subsequently executed.
With this example, we aim to bring awareness about the value of having platform independent expression of workflows.
6. Biocompute Object format, an IEEE specification suited for use in regulatory applications.¶
If computational analyses on sequence data are performed in the context of clinical trials, for instance to demonstrate the transcriptomics response to a drug or to show to safety of a compound in populations of distinct genetic background using genotyping information, it is a regulatory requirements of the US FDA to submit the computational workflows if seeking approval. The availability of such information in this context is a prerequisite for FDA auditors to examine the data.
The IEEE 2791-2020 specifications, also known as BCO for BioCompute Object is a specification to do this.
This has been made possible thanks to the fast-track submission of a new data format specifically tailored to ensure reproducibility and unambiguous description of workflow key descriptors.

Fig. 51 Cloud based tools supported BCO specifications¶
What are the main features of a BioCompute Object?¶
a BioCompute Object is serialized as a JSON document. A typical BCO looks like this:
a BioCompute Object may allow referencing a CWL expressed workflow thus increasing interoperability.