This WILDS WDL workflow performs RNA-seq analysis using STAR's two-pass alignment methodology and DESeq2 differential expression analysis. It is intended to be a straightforward demonstration of an RNA sequencing pipeline within the context of the WILDS ecosystem.
The workflow performs the following key steps:
- Optional automatic reference genome download (if not provided)
- STAR index building
- STAR two-pass alignment for each sample
- RNA-SeQC quality control analysis
- Combining gene count matrices
- DESeq2 differential expression analysis with visualization
- Two-pass STAR alignment for improved splice junction detection
- Automatic reference genome download (optional)
- Quality control with RNA-SeQC
- Differential expression analysis with DESeq2
- Visualization of results (PCA, volcano plots, heatmaps)
- Compatible with the WILDS workflow ecosystem
- Cromwell, MiniWDL, or another WDL-compatible workflow executor
- Docker/Apptainer (the workflow uses WILDS Docker containers)
- Create an inputs JSON file with your sample information:
{
"star_deseq2.samples": [
{
"name": "sample1",
"r1": "/path/to/sample1_1.fastq.gz",
"r2": "/path/to/sample1_2.fastq.gz",
"condition": "treatment"
},
{
"name": "sample2",
"r1": "/path/to/sample2_1.fastq.gz",
"r2": "/path/to/sample2_2.fastq.gz",
"condition": "control"
}
],
"star_deseq2.reference_level": "control",
"star_deseq2.contrast": "condition,treatment,control"
}
- Run the workflow using your preferred WDL executor:
# Cromwell
java -jar cromwell.jar run ww-star-deseq2.wdl --inputs ww-star-deseq2-inputs.json --options ww-star-deseq2-options.json
# miniWDL
miniwdl run ww-star-deseq2.wdl -i ww-star-deseq2-inputs.json
The workflow pairs well with the ww-sra workflow for downloading data from NCBI's Sequence Read Archive.
The workflow accepts the following inputs:
Parameter | Description | Type | Required? | Default |
---|---|---|---|---|
samples |
Array of sample information objects | Array[SampleInfo] | Yes | - |
reference_genome |
Reference genome information | RefGenome | No | GRCh38.p14 (auto-downloaded) |
reference_level |
Reference level for DESeq2 | String | No | "" |
contrast |
Contrast string for DESeq2 | String | No | "" |
Each entry in the samples
array should contain:
name
: Sample identifierr1
: Path to R1 FASTQ filer2
: Path to R2 FASTQ filecondition
: Group/condition for differential expression
If provided, the reference_genome
should contain:
name
: Reference genome namefasta
: Path to reference FASTA filegtf
: Path to reference GTF file
The workflow produces the following outputs:
Output | Description |
---|---|
star_bam |
Aligned BAM files for each sample |
star_bai |
BAM indexes for each sample |
star_gene_counts |
Raw gene counts for each sample |
star_log_final |
STAR final logs |
star_log_progress |
STAR progress logs |
star_log |
STAR main logs |
star_sj |
STAR splice junction files |
rnaseqc_metrics |
RNA-SeQC quality metrics |
combined_counts_matrix |
Combined gene counts matrix |
sample_metadata |
Sample metadata table |
deseq2_all_results |
Complete DESeq2 results |
deseq2_significant_results |
Filtered significant DESeq2 results |
deseq2_normalized_counts |
DESeq2 normalized counts |
deseq2_pca_plot |
PCA plot of samples |
deseq2_volcano_plot |
Volcano plot of differential expression |
deseq2_heatmap |
Heatmap of differentially expressed genes |
For Fred Hutch users, we recommend using PROOF to submit this workflow directly to the on-premise HPC cluster. To do this:
- Start by either cloning or downloading a copy of this repository to your local machine.
- Cloning:
git clone https://github.com/getwilds/ww-star-deseq2.git
- Downloading: Click the green "Code" button in the top right corner, then click "Download ZIP".
- Cloning:
- Update
ww-star-deseq2-inputs.json
with your sample names, FASTQ file paths, and conditions. - Update
ww-star-deseq2-options.json
with your preferred location for output data to be saved to (final_workflow_outputs_dir
). - Submit the WDL file along with your custom json's to the Fred Hutch cluster via PROOF by following our SciWiki documentation.
Additional Notes:
- Keep in mind that all file paths in the jsons must be visible to the Fred Hutch cluster, e.g.
/fh/fast/
, AWS S3 bucket. Input file paths on your local machine won't work in PROOF. - Specific reference genome files can be provided as inputs, but if none are provided, the workflow will automatically download a GRCh38 reference genome and use that. For the first go-around, we recommend starting with the default reference files.
- To avoid duplication of reference genome data, we highly recommend executing this workflow with call caching enabled in the options json (
write_to_cache
,read_from_cache
, already set totrue
here).
This workflow uses the following Docker containers from the WILDS Docker Library:
getwilds/star:2.7.6a
- For STAR alignmentgetwilds/rnaseqc:2.4.2
- For RNA-SeQC quality controlgetwilds/gtf-smash:v8
- For reference genome downloadgetwilds/combine-counts:0.1.0
- For combining count matricesgetwilds/deseq2:1.40.2
- For differential expression analysis
All containers are available on both DockerHub and GitHub Container Registry.
For questions, bugs, and/or feature requests, reach out to the Fred Hutch Data Science Lab (DaSL) at wilds@fredhutch.org, or open an issue on our issue tracker.
If you would like to contribute to this WILDS WDL workflow, see our contribution guidelines as well as our WILDS Contributor Guide for more details.
Distributed under the MIT License. See LICENSE
for details.