Pipeline setup and configuration¶
There are a number of main files that governs how the pipeline is executed listed below:
- Snakefile
- common.smk
- config.yaml
- resources.yaml
- profile/uppsala/config.yaml
- samples.tsv and units.tsv
There is more general information about the content of these files in hydra-genetics documentation in code standards, config and Snakefile.
Snakefile¶
The Snakefile is located in workflow/ and imports hydra-genetics modules and rules as well as modifies these rules when needed. It also imports pipeline specific rules and define rule orders. Finally, this is where the rule all is defined.
common.smk¶
The common.smk is located under workflow/rules/. This is a general rule taking care of any actions that are not directly connected with running a specific program. It includes version checks, import of config, resources, tsv-files and validations using schemas. Functions used by pipeline specific rules are also defined here as well as the output files using the function compile_output_list which programmatically generates a list of all necessary output files for the module to be targeted in the all rule defined in the Snakemake file. See further Result files.
config.yaml¶
The config.yaml is located under config/. The file ties all file and other dependencies as well as parameters for different rules together.
See further pipeline configuration.
Expand to view current config.yaml
resources: "config/resources.yaml"
samples: "samples.tsv"
units: "units.tsv"
#hydra_local_path: "PATH_TO_REPO"
modules:
alignment: "v0.7.0"
annotation: "v1.4.0"
biomarker: "v0.8.0"
cnv_sv: "v3.0.0"
filtering: "v0.3.0"
fusions: "v0.3.1"
prealignment: "v1.3.0"
qc: "v0.7.0"
reports: "v1.1.1"
snv_indels: "v2.1.0"
output: "config/output_files_FFPE.yaml"
default_container: "docker://hydragenetics/common:3.1.1.1"
trimmer_software: "fastp_pe"
subsample: "None"
arriba:
container: "docker://hydragenetics/arriba:2.3.0"
arriba_draw_fusion:
container: "docker://hydragenetics/arriba:2.3.0"
add_multi_snv_in_codon:
af_limit: 0.00
artifact_limit: 10000
bcbio_variation_recall_ensemble:
container: "docker://hydragenetics/bcbio-vc:0.2.6"
callers:
- vardict
- gatk_mutect2
bcftools_annotate:
output_type: "z"
annotation_string: "-m DB"
bwa_mem:
container: "docker://hydragenetics/bwa_mem:0.7.17"
bwa_mem_merge:
extra: "-c -p"
bwa_mem_realign_consensus_reads:
container: "docker://hydragenetics/fgbio:2.1.0"
call_small_cnv_deletions:
window_size: 4
region_max_size: 30
min_nr_stdev_diff: 2.5
min_log_odds_diff: 0.3
call_small_cnv_amplifications:
window_size: 3
region_max_size: 15
min_nr_stdev_diff: 8
min_log_odds_diff: 0.4
cnvkit_batch:
container: "docker://hydragenetics/cnvkit:0.9.11"
extra: "--drop-low-coverage"
method: "hybrid"
cnvkit_batch_hrd:
container: "docker://hydragenetics/cnvkit:0.9.11"
method: "hybrid"
cnvkit_call:
container: "docker://hydragenetics/cnvkit:0.9.11"
cnvkit_diagram:
container: "docker://hydragenetics/cnvkit:0.9.11"
cnvkit_export_seg:
container: "docker://hydragenetics/cnvkit:0.9.11"
cnvkit_scatter:
container: "docker://hydragenetics/cnvkit:0.9.11"
cnv_html_report:
show_table: true
cytobands: true
extra_tables:
- name: Small CNVs and 1p19q
description: >
Additional small amplifications and deletions as well as 1p19q co-deletions called by Twist Solid
in-house scripts. Can have overlaps with called regions from other callers.
path: "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.cnv_loh_genes_all.cnv_additional_variants_only.tsv"
- name: Large chromosomal aberrations
description: >
Large chromosomal aberrations in the form of deletions, duplications and copy neutral loss of heterozygosity.
Also warnings of baseline skewness and detection of polyploidy in the sample.
path: "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.cnv_loh_genes_all.cnv_chromosome_arms.tsv"
cnv_tsv_report:
amp_cn_limit: 6.0
baseline_fraction_limit: 0.2
del_1p19q_cn_limit: 1.4
del_1p19q_chr_arm_fraction: 0.3
chr_arm_fraction: 0.3
del_chr_arm_cn_limit: 1.4
amp_chr_arm_cn_limit: 2.6
normal_baf_lower_limit: 0.3
normal_baf_upper_limit: 0.7
normal_cn_lower_limit: 1.7
normal_cn_upper_limit: 2.25
polyploidy_fraction_limit: 0.2
max_cnv_fp_size: 15000000
ctat_splicing_call:
container: "docker://hydragenetics/ctat-splicing:0.0.3"
estimate_ctdna_fraction:
gnomAD_AF_limit: 0.00001
max_somatic_af: 0.4
min_germline_af: 0.1
min_nr_SNPs_per_segment: 35
min_segment_length: 10000000
vaf_baseline: 0.48
fastp_pe:
container: "docker://hydragenetics/fastp:0.20.1"
# Default enabled trimming parameters for fastp. Specified for clarity.
extra: "--trim_poly_g --qualified_quality_phred 15 --unqualified_percent_limit 40 --n_base_limit 5 --length_required 15"
fastp_pe_arriba:
container: "docker://hydragenetics/fastp:0.20.1"
extra: "--max_len1 100"
fastqc:
container: "docker://hydragenetics/fastqc:0.11.9"
fgbio_call_and_filter_consensus_reads:
container: "docker://hydragenetics/fgbio:2.1.0"
max_base_error_rate: "0.2"
min_reads_call: "1 0 0"
min_reads_filter: "1 0 0"
min_input_base_quality_call: 20
min_input_base_quality_filter: 30
fgbio_copy_umi_from_read_name:
container: "docker://hydragenetics/fgbio:2.1.0"
fgbio_group_reads_by_umi:
container: "docker://hydragenetics/fgbio:2.1.0"
umi_strategy: paired
filter_vcf:
snv_soft_filter: "config/filters/config_soft_filter_uppsala_vep105.yaml"
snv_soft_filter_umi: "config/filters/config_soft_filter_umi_vep105.yaml"
snv_hard_filter: "config/filters/config_hard_filter_uppsala_vep105.yaml"
snv_hard_filter_umi: "config/filters/config_hard_filter_umi_vep105.yaml"
snv_hard_filter_purecn: "config/filters/config_hard_filter_purecn.yaml"
cnv_hard_filter_amp: "config/filters/config_hard_filter_cnv_amp.yaml"
cnv_hard_filter_loh: "config/filters/config_hard_filter_cnv_loh.yaml"
germline: "config/filters/config_hard_filter_germline_vep105.yaml"
itd_hard_filter: "config/filters/config_hard_filter_scanitd.yaml"
filter_fuseq_wes:
min_support: 50
filter_on_fusiondb: True
filter_fuseq_wes_umi:
min_support: 15
filter_on_fusiondb: True
finaletoolkit_end_motifs:
container: "docker://hydragenetics/finaletoolkit:0.10.7"
finaletoolkit_frag_length_bins:
container: "docker://hydragenetics/finaletoolkit:0.10.7"
finaletoolkit_interval_end_motifs:
container: "docker://hydragenetics/finaletoolkit:0.10.7"
finaletoolkit_interval_mds:
container: "docker://hydragenetics/finaletoolkit:0.10.7"
finaletoolkit_mds:
container: "docker://hydragenetics/finaletoolkit:0.10.7"
fragmentomics_fragment_length_patient_score:
container: "docker://hydragenetics/fragmentomics:0.1.0"
fuseq_wes:
container: "docker://hydragenetics/fuseq_wes:1.0.1"
fusioncatcher:
container: "docker://hydragenetics/fusioncatcher:1.33"
extra: ""
gatk_calculate_contamination:
container: "docker://hydragenetics/gatk4:4.1.9.0"
gatk_call_copy_ratio_segments:
container: "docker://hydragenetics/gatk4:4.1.9.0"
gatk_collect_allelic_counts:
container: "docker://hydragenetics/gatk4:4.1.9.0"
gatk_collect_read_counts:
container: "docker://hydragenetics/gatk4:4.1.9.0"
gatk_denoise_read_counts:
container: "docker://hydragenetics/gatk4:4.1.9.0"
gatk_model_segments:
container: "docker://hydragenetics/gatk4:4.1.9.0"
gatk_get_pileup_summaries:
container: "docker://hydragenetics/gatk4:4.1.9.0"
gatk_mutect2:
container: "docker://hydragenetics/gatk4:4.1.9.0"
gatk_mutect2_filter:
container: "docker://hydragenetics/gatk4:4.1.9.0"
gatk_mutect2_gvcf:
container: "docker://hydragenetics/gatk4:4.1.9.0"
gatk_mutect2_merge_stats:
container: "docker://hydragenetics/gatk4:4.1.9.0"
gene_fuse:
container: "docker://hydragenetics/genefuse:0.6.1"
general_html_report:
final_directory_depth: 3
multiqc_config: "config/reports/multiqc_config_dna.yaml"
juli_annotate:
container: "docker://hydragenetics/juli:0.1.6.2"
juli_call:
container: "docker://hydragenetics/juli:0.1.6.2"
jumble_cnvkit_call:
container: "docker://hydragenetics/cnvkit:0.9.9"
jumble_run:
container: "docker://hydragenetics/jumble:260310"
jumble_vcf:
dup_limit: 2.5
het_del_limit: 1.5
hom_del_limit: 0.5
hotspot_report:
report_config: "config/reports/hotspot_report.yaml"
levels:
- [200, "ok", "yes"]
- [30, "low", "yes"]
- [0, "low", "not analyzable"]
manta_config_t:
container: "docker://hydragenetics/manta:1.6.0"
manta_run_workflow_t:
container: "docker://hydragenetics/manta:1.6.0"
merge_af_complex_variants:
merge_method: "sum"
merge_cnv_json:
cancer_genes: "config/reports/cancer_genes.csv"
filtered_cnv_vcfs:
- cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.cnv_amp_genes.filter.cnv_hard_filter_amp.fp_tag.vcf
- cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.cnv_loh_genes_all.filter.cnv_hard_filter_loh.fp_tag.annotate_fp.vcf
germline_vcf: snv_indels/bcbio_variation_recall_ensemble/{sample}_{type}.ensembled.vep_annotated.filter.germline.exclude.blacklist.vcf.gz
unfiltered_cnv_vcfs:
- cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.cnv_amp_genes.fp_tag.vcf.gz
- cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.cnv_loh_genes_all.fp_tag.vcf.gz
unfiltered_cnv_vcfs_tbi:
- cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.cnv_amp_genes.fp_tag.vcf.gz.tbi
- cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.cnv_loh_genes_all.fp_tag.vcf.gz.tbi
mosdepth:
container: "docker://hydragenetics/mosdepth:0.3.2"
extra: "--no-per-base --fast-mode"
mosdepth_bed:
container: "docker://hydragenetics/mosdepth:0.3.2"
msisensor_pro:
container: "docker://hydragenetics/msisensor_pro:1.1.a"
extra: "-c 50"
multiqc:
container: "docker://hydragenetics/multiqc:1.21"
reports:
DNA:
config: "config/reports/multiqc_config_dna.yaml"
included_unit_types: ["N", "T"]
deduplication: ["mark_duplicates"]
qc_files:
- "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq1_fastqc.zip"
- "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq2_fastqc.zip"
- "qc/picard_collect_alignment_summary_metrics/{sample}_{type}.alignment_summary_metrics.txt"
- "qc/picard_collect_duplication_metrics/{sample}_{type}.duplication_metrics.txt"
- "qc/picard_collect_hs_metrics/{sample}_{type}.HsMetrics.txt"
- "qc/picard_collect_insert_size_metrics/{sample}_{type}.insert_size_metrics.txt"
- "qc/samtools_stats/{sample}_{type}.samtools-stats.txt"
- "qc/gatk_calculate_contamination/{sample}_{type}.contamination.table"
DNA_umi:
config: "config/reports/multiqc_config_dna.yaml"
included_unit_types: ["N", "T"]
deduplication: ["umi"]
qc_files:
- "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq1_fastqc.zip"
- "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq2_fastqc.zip"
- "qc/picard_collect_alignment_summary_metrics/{sample}_{type}.alignment_summary_metrics.txt"
- "qc/picard_collect_duplication_metrics/{sample}_{type}.duplication_metrics.txt"
- "qc/picard_collect_hs_metrics/{sample}_{type}.HsMetrics.txt"
- "qc/picard_collect_insert_size_metrics/{sample}_{type}.insert_size_metrics.txt"
- "qc/samtools_stats/{sample}_{type}.samtools-stats.txt"
- "qc/gatk_calculate_contamination/{sample}_{type}.contamination.table"
- "alignment/fgbio_group_reads_by_umi/{sample}_{type}.umi.histo.tsv"
RNA:
config: "config/reports/multiqc_config_rna.yaml"
included_unit_types: ["R"]
qc_files:
- "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq1_fastqc.zip"
- "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq2_fastqc.zip"
- "qc/mosdepth/{sample}_{type}.mosdepth.global.dist.txt"
- "qc/mosdepth/{sample}_{type}.mosdepth.region.dist.txt"
- "qc/picard_collect_alignment_summary_metrics/{sample}_{type}.alignment_summary_metrics.txt"
- "qc/picard_collect_duplication_metrics/{sample}_{type}.duplication_metrics.txt"
- "qc/picard_collect_hs_metrics/{sample}_{type}.HsMetrics.txt"
- "qc/samtools_stats/{sample}_{type}.samtools-stats.txt"
- "qc/mosdepth/{sample}_{type}.regions.bed.gz"
optitype:
#container: "docker://hydragenetics/optitype:1.3.5"
container: "docker://fred2/optitype:release-v1.3.1"
sample_type: "-d"
enumeration: 4
picard_collect_alignment_summary_metrics:
container: "docker://hydragenetics/picard:2.25.0"
picard_collect_hs_metrics:
container: "docker://hydragenetics/picard:2.25.0"
extra: "COVERAGE_CAP=50000"
picard_collect_duplication_metrics:
container: "docker://hydragenetics/picard:2.25.0"
picard_collect_insert_size_metrics:
container: "docker://hydragenetics/picard:2.25.0"
picard_mark_duplicates:
container: "docker://hydragenetics/picard:2.25.0"
purecn:
container: "docker://hydragenetics/purecn:2.2.0"
genome: "hg19"
interval_padding: 100
segmentation_method: "internal"
fun_segmentation: "PSCBS"
purecn_coverage:
container: "docker://hydragenetics/purecn:2.2.0"
purecn_purity_file:
min_purity: 0.01
report_fusions:
fusioncatcher_flag_low_support: 15
fusioncatcher_low_support: 3
fusioncatcher_low_support_inframe: 6
star_fusion_flag_low_support: 15
star_fusion_low_support: 2
star_fusion_low_support_inframe: 6
report_gene_fuse:
min_unique_reads: 6
sample_mixup_check:
match_cutoff: 0.7
samtools_merge_bam:
extra: "-c -p"
samtools_merge_bam_umi:
extra: "-c -p"
scarhrd:
container: "docker://hydragenetics/scarhrd:20200825"
seqz: FALSE
scanitd:
container: "docker://hydragenetics/scanitd:0.9.2"
seqtk_subsample:
container: "docker://hydragenetics/seqtk:1.4"
nr_reads: 100000000
nr_reads_rna: 100000000
somalier_ungrouped_extract:
container: "docker://hydragenetics/somalier:0.2.18"
env: SOMALIER_SAMPLE_NAME={sample}_{type}
somalier_ungrouped_relate:
container: "docker://hydragenetics/somalier:0.2.18"
somalier_best_match_report:
match_cutoff: 0.7
star:
container: "docker://hydragenetics/star:2.7.10a"
star_fusion:
container: "docker://hydragenetics/star-fusion:1.10.1"
extra: "--examine_coding_effect"
svdb_merge:
container: "docker://hydragenetics/svdb:2.6.0"
tc_method:
- name: pathology_purecn
cnv_caller:
- cnvkit
- gatk
- jumble
priority: "cnvkit,gatk,jumble"
- name: purecn
cnv_caller:
- cnvkit
- gatk
- jumble
priority: "cnvkit,gatk,jumble"
- name: pathology
cnv_caller:
- cnvkit
- gatk
- jumble
priority: "cnvkit,gatk,jumble"
overlap: 1 #Just merge the two vcf-files without merging regions
extra: "--pass_only" #Just merge the two vcf-files without merging regions
svdb_query:
container: "docker://hydragenetics/svdb:2.6.0"
tmb:
af_lower_limit: 0.05
af_upper_limit: 0.95
af_germline_lower_limit: 0.47
af_germline_upper_limit: 0.53
artifacts: ""
background_panel: ""
db1000g_limit: 0.0001
dp_limit: 100
gnomad_limit: 0.0001
vd_limit: 10
nr_avg_germline_snvs: 2.0
nssnv_tmb_correction: 0.84
variant_type_list: ["missense_variant", "stop_gained", "stop_lost"]
tmb_umi:
af_lower_limit: 0.003
af_upper_limit: 0.997
af_germline_lower_limit: 0.40
af_germline_upper_limit: 0.60
background_panel: ""
db1000g_limit: 0.0001
dp_limit: 100
gnomad_limit: 0.0001
vd_limit: 10
nr_avg_germline_snvs: 0.0
nssnv_tmb_correction: 1.09
filter_nr_observations: 3
variant_type_list: ["missense_variant", "stop_gained", "stop_lost", "synonymous_variant", "stop_retained_variant"]
vardict:
container: "docker://hydragenetics/vardict:1.8.3"
allele_frequency_threshold: "0.01"
allele_frequency_threshold_umi: "0.001"
bed_columns: "-c 1 -S 2 -E 3 -g 4"
extra: "-Q 1"
vep:
container: "docker://hydragenetics/vep:105"
mode: --offline --cache --refseq
vep_wo_pick:
mode: --offline --cache --refseq
vt_decompose:
container: "docker://hydragenetics/vt:2015.11.10"
vt_normalize:
container: "docker://hydragenetics/vt:2015.11.10"
whatshap_phase:
container: "docker://hydragenetics/whatshap:2.8"
resources.yaml¶
The resources.yaml is located under config/. The file declares default resources used by rules as well as resources for specific rules that needs more resources than allocated by default. See further pipeline configuration.
# ex, default resources
default_resources:
threads: 1
time: "4:00:00"
mem_mb: 6144
mem_per_cpu: 6144
partition: "low"
# ex, rule override
vardict:
time: "48:00:00"
Expand to view current resources.yaml
default_resources:
threads: 1
time: "6:00:00"
mem_mb: 6144
mem_per_cpu: 6144
partition: "low"
arriba:
threads: 5
time: "8:00:00"
mem_mb: 30720
mem_per_cpu: 6144
bwa_mem:
mem_mb: 61440
mem_per_cpu: 6144
threads: 10
time: "24:00:00"
bwa_mem_realign_consensus_reads:
mem_mb: 61440
mem_per_cpu: 6144
threads: 10
time: "24:00:00"
fastp_pe:
threads: 5
mem_mb: 30720
mem_per_cpu: 6144
fastp_pe_arriba:
threads: 5
mem_mb: 30720
mem_per_cpu: 6144
fgbio_call_and_filter_consensus_reads:
threads: 3
mem_mb: 18432
mem_per_cpu: 6144
time: "24:00:00"
fgbio_group_reads_by_umi:
time: "24:00:00"
fuseq_wes:
threads: 2
mem_mb: 12288
time: "48:00:00"
fusioncatcher:
threads: 10
time: "16:00:00"
mem_mb: 61440
mem_per_cpu: 6144
fastqc:
threads: 2
mem_mb: 12288
mem_per_cpu: 6144
juli_call:
threads: 10
mem_mb: 61440
mem_per_cpu: 6144
time: "48:00:00"
jumble_run:
threads: 10
manta_run_workflow_t:
threads: 4
time: "8:00:00"
mem_mb: 24576
mem_per_cpu: 6144
gatk_mutect2_gvcf:
time: "8:00:00"
gatk_mutect2:
time: "8:00:00"
gene_fuse:
threads: 6
time: "8:00:00"
mem_mb: 36864
mem_per_cpu: 6144
optitype:
threads: 6
time: "4:00:00"
mem_mb: 36864
mem_per_cpu: 6144
star:
threads: 8
time: "8:00:00"
mem_mb: 49152
mem_per_cpu: 6144
star_fusion:
threads: 8
time: "16:00:00"
mem_mb: 49152
mem_per_cpu: 6144
vardict:
time: "48:00:00"
vep:
threads: 5
time: "6:00:00"
mem_mb: 30720
mem_per_cpu: 6144
profile yaml¶
Profiles are saved in yaml files and used to control how snakemake will be executed, if jobs will be submitted to a cluster, use singularity, restart on failure and so forth. It also forward requested resources to drmaa using a drmaa variable.
# ex, snakemake settings
jobs: 100
keep-going: True
restart-times: 2
rerun-incomplete: True
use-singularity: True
configfile: "config/config.yaml"
singularity-args: "-e --cleanenv -B /projects -B /data -B /beegfs
# ex, drmaa settings
drmaa: " -A wp1 -N 1-1 -t {resources.time} -n {resources.threads} --mem={resources.mem_mb} --mem-per-cpu={resources.mem_per_cpu} --mem-per-cpu={resources.mem_per_cpu} --partition={resources.partition} -J {rule} -e slurm_out/{rule}_%j.err -o slurm_out/{rule}_%j.out"
drmaa-log-dir: "slurm_out"
default-resources: [threads=1, time="04:00:00", partition="low", mem_mb="3074", mem_per_cpu="3074"]
singularity-prefix: "/path/to/singularity_cache/"
wrapper-prefix: "https://github.com/hydra-genetics/snakemake-wrappers/raw/"
samples.tsv and units.tsv¶
The samples.tsv and units.tsv are input files that must be generated before running the pipeline and should in general be located in the base folder of the analysis folder, can be changed in the config.yaml. See further running the pipeline and create input files.
Example samples.tsv¶
| sample | tumor_content |
|---|---|
| NA12878 | 0.5 |
| NA12879 | 0 |
| NA12880 | 1 |
Example units.tsv¶
| sample | type | machine | platform | flowcell | lane | barcode | fastq1 | fastq2 | adapter |
|---|---|---|---|---|---|---|---|---|---|
| NA12878 | T | NDX550407_RUO | NextSeq | HKTG2BGXG | L001 | ACGGAACA+ACGAGAAC | fastq/NA12878_fastq1.fastq.gz | fastq/NA12878_fastq2.fastq.gz | ACGT,ACGT |
| NA12879 | N | NDX550407_RUO | NextSeq | HKTG2BGXG | L001 | TCGGAACT+TCGAGAAT | fastq/NA12879_fastq1.fastq.gz | fastq/NA12879_fastq2.fastq.gz | ACGT,ACGT |
| NA12880 | R | NDX550407_RUO | NextSeq | HKTG2BGXG | L002 | GCGGAACG+GCGAGAAG | fastq/NA12880_fastq1.fastq.gz | fastq/NA12880_fastq2.fastq.gz | ACGT,ACGT |