Rule specific to the Twist Solid pipeline that are not defined in hydra

bcftools_id_snps

Trim .fastq files by removing adapter sequences and other unwanted sequences. Adapter sequences are specified in units.tsv under the adapter column. See further ID-SNPs info.

🐍 Rule

rule bcftools_id_snps:
    input:
        bam="fusions/star_fusion/{sample}_{type}/Aligned.out.sorted.bam",
        bai="fusions/star_fusion/{sample}_{type}/Aligned.out.sorted.bam.bai",
        bed=config.get("bcftools_id_snps", {}).get("snps_bed", "missing bcftools_id_snps snps_bed file"),
        ref=config.get("reference", {}).get("fasta_rna", "missing reference fasta_rna"),
    output:
        vcf=temp("snv_indels/bcftools_id_snps/{sample}_{type}.id_snps.vcf"),
    wildcard_constraints:
        type="[R]",
    log:
        "snv_indels/bcftools_id_snps/{sample}_{type}.id_snps.vcf.log",
    benchmark:
        repeat(
            "snv_indels/bcftools_id_snps/{sample}_{type}.id_snps.vcf.benchmark.tsv",
            config.get("bcftools_id_snps", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("bcftools_id_snps", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("bcftools_id_snps", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("bcftools_id_snps", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("bcftools_id_snps", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("bcftools_id_snps", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("bcftools_id_snps", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("bcftools_id_snps", {}).get("container", config["default_container"])
    message:
        "{rule}: call id SNPs in the RNA data {output.vcf}"
    shell:
        "(bcftools mpileup "
        "-R {input.bed} "
        "-O u "
        "-f {input.ref} "
        "-d 1000000 "
        "{input.bam} "
        "| bcftools call "
        "--skip-variants indels "
        "-c -O v "
        "-o {output.vcf}) "
        "&> {log}"

↔ input / output files

Rule parameters Key Value Description
input bam "fusions/star_fusion/{sample}_{type}/Aligned.out.sorted.bam" Bam file from an split-read aware aligner like STAR
bai "fusions/star_fusion/{sample}_{type}/Aligned.out.sorted.bam.bai" Bam index file
bed config.get("bcftools_id_snps", {}).get("snps_bed", "missing bcftools_id_snps snps_bed file") Bed file with regions or positions to pileup
ref config.get("reference", {}).get("fasta_rna", "missing reference fasta_rna") Reference genome
output vcf "snv_indels/bcftools_id_snps/{sample}_{type}.id_snps.vcf" Vcf with pileup information for all positions in input bed file

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated for msisensor_pro
container string name or path to docker/singularity container
snps_bed string path to bed file the id SNPs

Resources settings (resources.yaml)

Key Type Description
mem_mb integer memory in MB used per cpu
mem_per_cpu integer memory used per cpu
partition string partition to use on cluster
time string max execution time
threads integer number of threads to be available

call_small_cnv_amplifications

The CNVkit and GATK CNV caller often miss small amplifications of four exons or smaller. This rule analyses genes of interest and reports small amplifications in these genes when predefined criteria are met. The caller is based on two rolling window averaged over the region and finds the largest difference in log2ratio between these two windows. If the difference is large enough it is reported. See further call small cnv amplifications info.

🐍 Rule

rule call_small_cnv_amplifications:
    input:
        cnv_data="cnv_sv/gatk_denoise_read_counts/{sample}_{type}.clean.denoisedCR.tsv",
        regions_file=config.get("call_small_cnv_amplifications", {}).get("regions_file", ""),
    output:
        amplifications=temp("cnv_sv/call_small_cnv_amplifications/{sample}_{type}.amplifications.tsv"),
    params:
        window_size=config.get("call_small_cnv_amplifications", {}).get("window_size", 4),
        region_max_size=config.get("call_small_cnv_amplifications", {}).get("region_max_size", 30),
        min_nr_stdev_diff=config.get("call_small_cnv_amplifications", {}).get("min_nr_stdev_diff", 8),
        min_log_odds_diff=config.get("call_small_cnv_amplifications", {}).get("min_log_odds_diff", 0.4),
    log:
        "cnv_sv/call_small_cnv_amplifications/{sample}_{type}.amplifications.tsv.log",
    benchmark:
        repeat(
            "cnv_sv/call_small_cnv_amplifications/{sample}_{type}.amplifications.tsv.benchmark.tsv",
            config.get("call_small_cnv_amplifications", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("call_small_cnv_amplifications", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("call_small_cnv_amplifications", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("call_small_cnv_amplifications", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("call_small_cnv_amplifications", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("call_small_cnv_amplifications", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("call_small_cnv_amplifications", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("call_small_cnv_amplifications", {}).get("container", config["default_container"])
    message:
        "{rule}: call small amplifications in cnv data into {output.amplifications}"
    script:
        "../scripts/call_small_cnv_amplifications.py"

↔ input / output files

Rule parameters Key Value Description
input cnv_data "cnv_sv/gatk_denoise_read_counts/{sample}_{type}.clean.denoisedCR.tsv" Copy number region data from GATK CNV
regions_file config.get("call_small_cnv_amplifications", {}).get("regions_file", "") File with genes, gene regions with surrounding cnv-probes and the actual gene region
output amplifications "cnv_sv/call_small_cnv_amplifications/{sample}_{type}.amplifications.tsv" Text file with the called amplifications

🔧 Configuration

Software settings (config.yaml)

Key Type Description
container string name or path to docker/singularity container
regions_file string File with genes, gene regions with surrounding cnv-probes and the actual gene region.
Defines both the genomic position of the genes and their surrounding regions that should be analysed for this gene.
The size of the analysis region is dependent on the density of the backbone but should at least 20 backbone SNPs.
20 SNPs is used in Pipeline.
window_size integer Number of data points included in the sliding window
region_max_size integer Max number of data points of deletion to be reported
min_nr_stdev_diff number Min number of standard deviations difference between the deletion and surrounding data points in region
min_log_odds_diff number Min log2ratio difference between the deletion and the average of surrounding data points in region

Resources settings (resources.yaml)

Key Type Description
mem_mb integer memory in MB used per cpu
mem_per_cpu integer memory used per cpu
partition string partition to use on cluster
time string max execution time
threads integer number of threads to be available

call_small_cnv_deletions

The CNVkit and GATK CNV caller often miss small deletions of four exons or smaller. This rule analyses genes of interest and reports small deletions in these genes when predefined criteria are met. The caller is based on two rolling window averaged over the region and finds the largest difference in log2ratio between these two windows. If the difference is large enough it is reported. See further call small cnv deletions info.

🐍 Rule

rule call_small_cnv_deletions:
    input:
        cnv_data="cnv_sv/gatk_denoise_read_counts/{sample}_{type}.clean.denoisedCR.tsv",
        regions_file=config.get("call_small_cnv_deletions", {}).get("regions_file", ""),
    output:
        deletions=temp("cnv_sv/call_small_cnv_deletions/{sample}_{type}.deletions.tsv"),
    params:
        blacklist=config.get("call_small_cnv_deletions", {}).get("blacklist", ""),
        min_nr_stdev_diff=config.get("call_small_cnv_deletions", {}).get("min_nr_stdev_diff", 2.5),
        min_log_odds_diff=config.get("call_small_cnv_deletions", {}).get("min_log_odds_diff", 0.3),
        region_max_size=config.get("call_small_cnv_deletions", {}).get("region_max_size", 30),
        window_size=config.get("call_small_cnv_deletions", {}).get("window_size", 4),
    log:
        "cnv_sv/call_small_cnv_deletions/{sample}_{type}.deletions.tsv.log",
    benchmark:
        repeat(
            "cnv_sv/call_small_cnv_deletions/{sample}_{type}.deletions.tsv.benchmark.tsv",
            config.get("call_small_cnv_deletions", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("call_small_cnv_deletions", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("call_small_cnv_deletions", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("call_small_cnv_deletions", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("call_small_cnv_deletions", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("call_small_cnv_deletions", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("call_small_cnv_deletions", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("call_small_cnv_deletions", {}).get("container", config["default_container"])
    message:
        "{rule}: call small deletions in cnv data into {output.deletions}"
    script:
        "../scripts/call_small_cnv_deletions.py"

↔ input / output files

Rule parameters Key Value Description
input cnv_data "cnv_sv/gatk_denoise_read_counts/{sample}_{type}.clean.denoisedCR.tsv" Copy number region data from GATK CNV
regions_file config.get("call_small_cnv_deletions", {}).get("regions_file", "") File with genes, gene regions with surrounding cnv-probes and the actual gene region
output deletions "cnv_sv/call_small_cnv_deletions/{sample}_{type}.deletions.tsv" Text file with the called deletions

🔧 Configuration

Software settings (config.yaml)

Key Type Description
container string name or path to docker/singularity container
regions_file string File with genes, gene regions with surrounding cnv-probes and the actual gene region.
Defines both the genomic position of the genes and their surrounding regions that should be analysed for this gene.
The size of the analysis region is dependent on the density of the backbone but should at least 20 backbone SNPs.
40 SNPs is used in Pipeline.
blacklist string File with gene regions that should be filtered (Format is header + gene\tchr\tpos1\tpos2)
min_nr_stdev_diff number Min number of standard deviations difference between the deletion and surrounding data points in region
min_log_odds_diff number Min log2ratio difference between the deletion and the average of surrounding data points in region
region_max_size integer Max number of data points of deletion to be reported
window_size integer Number of data points included in the sliding window

Resources settings (resources.yaml)

Key Type Description
mem_mb integer memory in MB used per cpu
mem_per_cpu integer memory used per cpu
partition string partition to use on cluster
time string max execution time
threads integer number of threads to be available

cnv_tsv_report

Collect all CNV calls into an excel friendly text file. Adds potential 1p19q calls. Also add a FP flag for variants called by CNVkit that in GATK CNV does not have any signal in that region. See further CNV tsv report info.

🐍 Rule

rule cnv_tsv_report:
    input:
        amplifications="cnv_sv/call_small_cnv_amplifications/{sample}_{type}.amplifications.tsv",
        chrom_arm_size=config.get("cnv_tsv_report", {}).get("chrom_arm_size", ""),
        deletions="cnv_sv/call_small_cnv_deletions/{sample}_{type}.deletions.tsv",
        gatk_cnr="cnv_sv/gatk_denoise_read_counts/{sample}_{type}.clean.denoisedCR.tsv",
        org_vcfs=[
            "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.cnv_amp_genes.fp_tag.vcf",
            "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.{tag}.fp_tag.vcf",
        ],
        tc_file=get_tc_file,
        vcfs=[
            "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.cnv_amp_genes.filter.cnv_hard_filter_amp.fp_tag.vcf",
            "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.{tag}.filter.cnv_hard_filter_loh.fp_tag.vcf",
        ],
    output:
        tsv=temp("cnv_sv/svdb_query/{sample}_{type}.{tc_method}.{tag}.cnv_report.tsv"),
        tsv_additional_only=temp("cnv_sv/svdb_query/{sample}_{type}.{tc_method}.{tag}.cnv_additional_variants_only.tsv"),
        tsv_chrom_arms=temp("cnv_sv/svdb_query/{sample}_{type}.{tc_method}.{tag}.cnv_chromosome_arms.tsv"),
        vcf_del="cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.{tag}.filter.cnv_hard_filter_loh.fp_tag.annotate_fp.vcf",
    params:
        amp_chr_arm_cn_limit=config.get("cnv_tsv_report", {}).get("amp_chr_arm_cn_limit", ""),
        baseline_fraction_limit=config.get("cnv_tsv_report", {}).get("baseline_fraction_limit", ""),
        call_small_amplifications_cn_limit=config.get("cnv_tsv_report", {}).get("amp_cn_limit", ""),
        chr_arm_fraction=config.get("cnv_tsv_report", {}).get("chr_arm_fraction", ""),
        del_chr_arm_cn_limit=config.get("cnv_tsv_report", {}).get("del_chr_arm_cn_limit", ""),
        del_1p19q_cn_limit=config.get("cnv_tsv_report", {}).get("del_1p19q_cn_limit", ""),
        del_1p19q_chr_arm_fraction=config.get("cnv_tsv_report", {}).get("del_1p19q_chr_arm_fraction", ""),
        max_cnv_fp_size=config.get("cnv_tsv_report", {}).get("max_cnv_fp_size", ""),
        normal_cn_lower_limit=config.get("cnv_tsv_report", {}).get("normal_cn_lower_limit", ""),
        normal_cn_upper_limit=config.get("cnv_tsv_report", {}).get("normal_cn_upper_limit", ""),
        normal_baf_lower_limit=config.get("cnv_tsv_report", {}).get("normal_baf_lower_limit", ""),
        normal_baf_upper_limit=config.get("cnv_tsv_report", {}).get("normal_baf_upper_limit", ""),
        polyploidy_fraction_limit=config.get("cnv_tsv_report", {}).get("polyploidy_fraction_limit", ""),
        tc=get_tc,
    log:
        "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.{tag}.cnv_report.tsv.log",
    benchmark:
        repeat(
            "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.{tag}.cnv_report.tsv.benchmark.tsv",
            config.get("cnv_tsv_report", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("cnv_tsv_report", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("cnv_tsv_report", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("cnv_tsv_report", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("cnv_tsv_report", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("cnv_tsv_report", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("cnv_tsv_report", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("cnv_tsv_report", {}).get("container", config["default_container"])
    message:
        "{rule}: Convert cnv vcf to a tsv file: {output.tsv}"
    script:
        "../scripts/cnv_report.py"

↔ input / output files

Rule parameters Key Value Description
input amplifications "cnv_sv/call_small_cnv_amplifications/{sample}_{type}.amplifications.tsv" File with small CNV amplifications generated by call_small_cnv_amplifications
deletions "cnv_sv/call_small_cnv_deletions/{sample}_{type}.deletions.tsv" File with small CNV amplifications generated by call_small_cnv_deletions
org_vcfs [ "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.cnv_amp_genes.fp_tag.vcf", "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.{tag}.fp_tag.vcf", ] Unfiltered vcf-files with merged calls from CNVkit and GATK CNV
tc_file get_tc_file Tumor content of sample
vcfs [ "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.cnv_amp_genes.filter.cnv_hard_filter_amp.fp_tag.vcf", "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.{tag}.filter.cnv_hard_filter_loh.fp_tag.vcf", ] Filtered vcf-files with merged calls from CNVkit and GATK CNV
output tsv "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.{tag}.cnv_report.tsv" A CNV report in excel friendly format
tsv_additional_only "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.{tag}.cnv_additional_variants_only.tsv" File where only additional CNVs are reported. Used for extra table in CNV-html report.
tsv_chrom_arms "cnv_sv/svdb_query/{sample}_{type}.{tc_method}.{tag}.cnv_chromosome_arms.tsv" File where large CNVs are reported. Used for extra table in CNV-html report.

🔧 Configuration

Software settings (config.yaml)

Key Type Description
amp_chr_arm_cn_limit number Lower limit of copy number to be included in calculation of chr_arm_fraction for duplications
baseline_fraction_limit number Lower limit of fraction with neutral copy segments and neutral BAF values. A warning of potentially incorrect baseline is reported otherwise.
container string name or path to docker/singularity container
call_small_amplifications_cn_limit number Upper limit of copy number to be reported as a small amplification.
chr_arm_fraction number Fraction of the chromosome arms that needs to be deleted
del_1p19q_cn_limit number Upper limit of copy number to be included in calculation of del_1p19q_chr_arm_fraction
del_1p19q_chr_arm_fraction number Fraction of the chromosome arms that needs to be deleted
del_chr_arm_cn_limit number Upper limit of copy number to be included in calculation of chr_arm_fraction for deletions
max_cnv_fp_size number Max size of cnvkit cnv to be tested for false positive
normal_cn_lower_limit number Lower limit of copy number to be counted as a normal segment
normal_cn_upper_limit number Upper limit of copy number to be counted as a normal segment
normal_baf_lower_limit number Lower limit of BAF in segment to be counted as a normal segment
normal_baf_upper_limit number Upper limit of BAF in segment to be counted as a normal segment
polyploidy_fraction_limit number Upper limit of fraction with aberrant BAF / copy number segment combination. A warning of potential polyploidy is reported otherwise.

Resources settings (resources.yaml)

Key Type Description
mem_mb integer memory in MB used per cpu
mem_per_cpu integer memory used per cpu
partition string partition to use on cluster
time string max execution time
threads integer number of threads to be available

estimate_ctdna_fraction

Estimate ctDNA fraction based on CNV and SNP information using an in-house script. For CNV based estimation the VAF-values of germline SNPs (the BAF-plot) for each segment is used to make an density calculation. If two peaks are found on opposite sides of 50% allele frequency the separation of the peak is used to calculate the ctDNA fraction. The highest ctDNA fraction is reported. If no CNV segments have peak separation 0% is reported. For SNV based estimation the ctDNA is reported as the highest variant allele frequency SNV times two.

🐍 Rule

rule estimate_ctdna_fraction:
    input:
        germline_vcf="snv_indels/bcbio_variation_recall_ensemble/{sample}_{type}.ensembled.vep_annotated.filter.germline.exclude.blacklist.vcf.gz",
        germline_vcf_tabix="snv_indels/bcbio_variation_recall_ensemble/{sample}_{type}.ensembled.vep_annotated.filter.germline.exclude.blacklist.vcf.gz.tbi",
        segments="cnv_sv/jumble_vcf/{sample}_{type}.pathology_purecn.vcf",
        vcf="snv_indels/bcbio_variation_recall_ensemble/{sample}_{type}.ensembled.vep_annotated.artifact_annotated.hotspot_annotated.background_annotated.include.exon.filter.snv_hard_filter_umi.codon_snvs.sorted.vep_annotated.qci.vcf",
    output:
        ctDNA_fraction=temp("cnv_sv/estimate_ctdna_fraction/{sample}_{type}.ctDNA_fraction.tsv"),
        ctDNA_fraction_info=temp("cnv_sv/estimate_ctdna_fraction/{sample}_{type}.ctDNA_fraction_info.tsv"),
    params:
        gnomAD_AF_limit=config.get("estimate_ctdna_fraction", {}).get("gnomAD_AF_limit", 0.00001),
        max_somatic_af=config.get("estimate_ctdna_fraction", {}).get("max_somatic_af", 0.4),
        min_germline_af=config.get("estimate_ctdna_fraction", {}).get("min_germline_af", 0.1),
        min_nr_SNPs_per_segment=config.get("estimate_ctdna_fraction", {}).get("min_nr_SNPs_per_segment", 35),
        min_segment_length=config.get("estimate_ctdna_fraction", {}).get("min_segment_length", 10000000),
        problematic_regions_beds=config.get("estimate_ctdna_fraction", {}).get("problematic_regions_beds", []),
        vaf_baseline=config.get("estimate_ctdna_fraction", {}).get("vaf_baseline", 0.48),
    log:
        "twist_solid/estimate_ctdna_fraction/{sample}_{type}.ctDNA_tc.tsv.log",
    benchmark:
        repeat(
            "twist_solid/estimate_ctdna_fraction/{sample}_{type}.ctDNA_tc.tsv.benchmark.tsv",
            config.get("estimate_ctdna_fraction", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("estimate_ctdna_fraction", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("estimate_ctdna_fraction", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("estimate_ctdna_fraction", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("estimate_ctdna_fraction", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("estimate_ctdna_fraction", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("estimate_ctdna_fraction", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("estimate_ctdna_fraction", {}).get("container", config["default_container"])
    message:
        "{rule}: estimate ctdna fraction based on CNV and SNV data into {output.ctDNA_fraction}"
    script:
        "../scripts/estimate_ctDNA_fraction.py"

↔ input / output files

Rule parameters Key Value Description
input germline_vcf "snv_indels/bcbio_variation_recall_ensemble/{sample}_{type}.ensembled.vep_annotated.filter.germline.exclude.blacklist.vcf.gz" Germline vcf only. Same file as used in BAF plots of the CNV html report
segments "cnv_sv/jumble_vcf/{sample}_{type}.pathology_purecn.vcf" CNV segments (from Jumble)
vcf "snv_indels/bcbio_variation_recall_ensemble/{sample}_{type}.ensembled.vep_annotated.artifact_annotated.hotspot_annotated.background_annotated.include.exon.filter.snv_hard_filter_umi.codon_snvs.sorted.vep_annotated.qci.vcf" Small variant vcf with vep annotations
output ctDNA_fraction "cnv_sv/estimate_ctdna_fraction/{sample}_{type}.ctDNA_fraction.tsv" Estimated ctDNA fraction values based on CNV and SNV data
ctDNA_fraction_info "cnv_sv/estimate_ctdna_fraction/{sample}_{type}.ctDNA_fraction_info.tsv" CNV segment info for regions with deletion signals as well as variants passing all filters

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
gnomAD_AF_limit number additional filtering based on existence in GnomAD (removes germline)
max_somatic_af number additional filtering based on maximal allowed allele frequency (removes germline)
min_germline_af number minimial germline allele frequency to be included in density calculations (removes somatic)
min_nr_SNPs_per_segment integer minimal number of germline SNPs in segments to be used for ctDNA estimation
min_segment_length integer minimal length of segments to be used for ctDNA estimation
problematic_regions_beds array list of bedfiles
vaf_baseline number Sets the VAF-baseline. In theory BAF-baseline is 0.5 (50% for each SNP), however in practice this is slightly shifted downwards due to reference bias in alignment

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

exon_skipping

Calls exon skipping events in RNA data. Only reports MET exon 14 skipping and the EGFRvIII variant. See further exon skipping info.

🐍 Rule

rule exon_skipping:
    input:
        bed=config.get("exon_skipping", {}).get("design_bed", ""),
        junction="alignment/star/{sample}_{type}.SJ.out.tab",
    output:
        result=temp("fusions/exon_skipping/{sample}_{type}.results.tsv"),
    log:
        "fusions/exon_skipping/{sample}_{type}.results.tsv.log",
    benchmark:
        repeat(
            "fusions/exon_skipping/{sample}_{type}.results.tsv.benchmark.tsv",
            config.get("exon_skipping", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("exon_skipping", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("exon_skipping", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("exon_skipping", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("exon_skipping", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("exon_skipping", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("exon_skipping", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("exon_skipping", {}).get("container", config["default_container"])
    message:
        "{rule}: Find intergenic fusions and report them in {output.result}"
    script:
        "../scripts/exon_skipping.py"

↔ input / output files

Rule parameters Key Value Description
input bed config.get("exon_skipping", {}).get("design_bed", "") Bed file used to extract exonic genomic regions
junction "alignment/star/{sample}_{type}.SJ.out.tab" Junction file from STAR based on split reads
output result "fusions/exon_skipping/{sample}_{type}.results.tsv" Report with info regarding MET exon 14 and EGFR vIII skipping

🔧 Configuration

Software settings (config.yaml)

Key Type Description
design_bed string path to design bed file

Resources settings (resources.yaml)

Key Type Description
mem_mb integer memory in MB used per cpu
mem_per_cpu integer memory used per cpu
partition string partition to use on cluster
time string max execution time
threads integer number of threads to be available

fix_vcf_ad_for_qci

Produces a QCI compatible vcf file with SNV and INDEL calls were the AD field has been corrected to so that the numbers corresponds with the AF field. QCI uses the AD field to calculate allele frequency. See further QCI fix info.

🐍 Rule

rule fix_vcf_ad_for_qci:
    input:
        vcf="{file}.vcf",
    output:
        vcf="{file}.qci.vcf",
    log:
        "{file}.qci.vcf.log",
    benchmark:
        repeat(
            "{file}.qci.vcf.benchmark.tsv",
            config.get("fix_vcf_ad_for_qci", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("fix_vcf_ad_for_qci", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("fix_vcf_ad_for_qci", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("fix_vcf_ad_for_qci", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("fix_vcf_ad_for_qci", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("fix_vcf_ad_for_qci", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("fix_vcf_ad_for_qci", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("fix_vcf_ad_for_qci", {}).get("container", config["default_container"])
    message:
        "{rule}: Correct AD so the correct AF is shown in QCI for vcf {input.vcf}"
    script:
        "../scripts/fix_vcf_ad_for_qci.py"

↔ input / output files

Rule parameters Key Value Description
input vcf "{file}.vcf" Vcf file with SNV and INDEL calls
output vcf "{file}.qci.vcf" QCI compatible vcf file with SNV and INDEL calls were the AD field has been corrected

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated

Resources settings (resources.yaml)

Key Type Description
mem_mb integer memory in MB used per cpu
mem_per_cpu integer memory used per cpu
partition string partition to use on cluster
time string max execution time
threads integer number of threads to be available

hotspot_report

A excel friendly text file containing information of all hotspot postions as well as all called variants. For all hotspot positions the coverage and alternative allele information is reported as well as additional annotations. The columns reported are configured by the config file hotspot_report.yaml. See further Hotspot report info.

🐍 Rule

rule hotspot_report:
    input:
        hotspots=lambda wildcards: config["hotspot_report"]["hotspot_mutations"][wildcards.tag],
        vcf=lambda wildcards: "%s%s" % (get_hotspot_report_vcf_input(wildcards), ".vep_annotated.vcf.gz"),
        vcf_index=lambda wildcards: "%s%s" % (get_hotspot_report_vcf_input(wildcards), ".vep_annotated.vcf.gz.tbi"),
        vcf_file_wo_pick=lambda wildcards: "%s%s" % (get_hotspot_report_vcf_input(wildcards), ".vep_annotated_wo_pick.vcf.gz"),
        vcf_file_wo_pick_index=lambda wildcards: "%s%s"
        % (get_hotspot_report_vcf_input(wildcards), ".vep_annotated_wo_pick.vcf.gz"),
        gvcf="qc/add_mosdepth_coverage_to_gvcf/{sample}_{type}.mosdepth.g.vcf.gz",
        gvcf_index="qc/add_mosdepth_coverage_to_gvcf/{sample}_{type}.mosdepth.g.vcf.gz.tbi",
    output:
        report=temp("qc/hotspot_report/{sample}_{type}.{tag}.output.tsv"),
    params:
        levels=config.get("hotspot_report", {}).get("levels", []),
        sample_name=lambda wildcards: "%s_%s" % (wildcards.sample, wildcards.type),
        report_config=config.get("hotspot_report", {})["report_config"],
        chr_translation_file=config.get("hotspot_report", {})["chr_translation_file"],
        extra=config.get("hotspot_report", {}).get("extra", ""),
    log:
        "qc/hotspot_report/{sample}_{type}.{tag}.output.tsv.log",
    benchmark:
        repeat(
            "qc/hotspot_report/{sample}_{type}.{tag}.output.benchmark.tsv",
            config.get("hotspot_report", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("hotspot_report", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("hotspot_report", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("hotspot_report", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("hotspot_report", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("hotspot_report", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("hotspot_report", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("hotspot_report", {}).get("container", config["default_container"])
    message:
        "{rule}: Make a coverage and mutations report: {output.report}"
    script:
        "../scripts/hotspot_report.py"

↔ input / output files

Rule parameters Key Value Description
input vcf lambda wildcards: "%s%s" % (get_hotspot_report_vcf_input(wildcards), ".vep_annotated.vcf.gz") Vcf file with filtered SNV and INDEL calls
vcf_index lambda wildcards: "%s%s" % (get_hotspot_report_vcf_input(wildcards), ".vep_annotated.vcf.gz.tbi") Vcf tabix index file
gvcf "qc/add_mosdepth_coverage_to_gvcf/{sample}_{type}.mosdepth.g.vcf.gz" Genome vcf file coverage and allele information on all positions in the panel
gvcf_index "qc/add_mosdepth_coverage_to_gvcf/{sample}_{type}.mosdepth.g.vcf.gz.tbi" Genome vcf tabix index file
output report "qc/hotspot_report/{sample}_{type}.{tag}.output.tsv" Excel friendly text report with hotspot position qc information

🔧 Configuration

Software settings (config.yaml)

Key Type Description
hotspot_mutations object List of files with hotspot positions and annotations. The names of the list items are found in the file names as {tag}.
report_config string file defining which fields should be extracted from the annotation
chr_translation_file string file used to translate chr to NC id. The format is a tab separate file with column chr and NC
levels array List of configurations.
1. depth above this value
2. depth level status, ex ok or low
3. depth status; ok, not analyzable
Example: - [200, "ok", "yes"]
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer memory in MB used per cpu
mem_per_cpu integer memory used per cpu
partition string partition to use on cluster
time string max execution time
threads integer number of threads to be available

hotspot_report_aa_translate

Take the output from hotspot report (coverage_and mutations) and translate the amino acid three letter code into one letter code using an in-house script.

🐍 Rule

rule hotspot_report_aa_translate:
    input:
        report="qc/hotspot_report/{sample}_{type}.{tag}.output.tsv",
    output:
        report=temp("qc/hotspot_report/{sample}_{type}.{tag}.aa_translated.tsv"),
    log:
        "twist_solid/hotspot_report_aa_translate/{sample}_{type}.{tag}.aa_translated.tsv.log",
    benchmark:
        repeat(
            "twist_solid/hotspot_report_aa_translate/{sample}_{type}.{tag}.aa_translated.tsv.benchmark.tsv",
            config.get("hotspot_report_aa_translate", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("hotspot_report_aa_translate", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("hotspot_report_aa_translate", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("hotspot_report_aa_translate", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("hotspot_report_aa_translate", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("hotspot_report_aa_translate", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("hotspot_report_aa_translate", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("hotspot_report_aa_translate", {}).get("container", config["default_container"])
    message:
        "{rule}: Translate aa three letter codes into one letter code in {input.report}"
    script:
        "../scripts/aa_translate.py"

↔ input / output files

Rule parameters Key Value Description
input report "qc/hotspot_report/{sample}_{type}.{tag}.output.tsv" coverage and mutations file from hotspot report
output report "qc/hotspot_report/{sample}_{type}.{tag}.aa_translated.tsv" coverage and mutations file from hotspot report with aa change in one letter code

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

jumble_gis_score

Extract the predicted Genomic Instability Score (GIS) from Jumble's output for the current sample's tumor content (TC).

🐍 Rule

rule jumble_gis_score:
    input:
        gis_csv="cnv_sv/jumble_run/{sample}_{type}/{sample}_{type}.jumble_gis.csv",
    output:
        gis_score=temp("cnv_sv/jumble_gis_score/{sample}_{type}.{tc_method}.predicted_gis.txt"),
    params:
        tc=get_tc,
    log:
        "cnv_sv/jumble_gis_score/{sample}_{type}.{tc_method}.predicted_gis.log",
    benchmark:
        repeat(
            "cnv_sv/jumble_gis_score/{sample}_{type}.{tc_method}.predicted_gis.benchmark.tsv",
            config.get("jumble_gis_score", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("jumble_gis_score", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("jumble_gis_score", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("jumble_gis_score", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("jumble_gis_score", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("jumble_gis_score", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("jumble_gis_score", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("jumble_gis_score", {}).get("container", config["default_container"])
    message:
        "{rule}: Extract predicted GIS score for {wildcards.sample}_{wildcards.type} at {params.tc} TC"
    script:
        "../scripts/extract_gis_score.py"

↔ input / output files

Rule parameters Key Value Description
input gis_csv "cnv_sv/jumble_run/{sample}_{type}/{sample}_{type}.jumble_gis.csv" GIS csv file from jumble_run
output gis_score "cnv_sv/jumble_gis_score/{sample}_{type}.{tc_method}.predicted_gis.txt" Extracted predicted GIS score for current TC

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
tc number tumor content

Resources settings (resources.yaml)

Key Type Description
mem_mb integer memory in MB used per cpu
mem_per_cpu integer memory used per cpu
partition string partition to use on cluster
time string max execution time
threads integer number of threads to be available

house_keeping_gene_coverage

List the average coverage of the housekeeping genes GAPDH, GUSB, OAZ1, and POLR2A. See further house keeping gene coverage info.

🐍 Rule

rule house_keeping_gene_coverage:
    input:
        bam="fusions/star_fusion/{sample}_{type}/Aligned.out.sorted.bam",
        bai="fusions/star_fusion/{sample}_{type}/Aligned.out.sorted.bam.bai",
        bed=config.get("reference", {}).get("design_bed_rna", ""),
    output:
        result=temp("qc/house_keeping_gene_coverage/{sample}_{type}.house_keeping_gene_coverage.tsv"),
    log:
        "fusions/house_keeping_gene_coverage/{sample}_{type}.house_keeping_gene_coverage.tsv.log",
    benchmark:
        repeat(
            "fusions/house_keeping_gene_coverage/{sample}_{type}.house_keeping_gene_coverage.tsv.benchmark.tsv",
            config.get("house_keeping_gene_coverage", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("house_keeping_gene_coverage", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("house_keeping_gene_coverage", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("house_keeping_gene_coverage", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("house_keeping_gene_coverage", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("house_keeping_gene_coverage", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("house_keeping_gene_coverage", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("house_keeping_gene_coverage", {}).get("container", config["default_container"])
    message:
        "{rule}: Find intergenic fusions and report them in {output.result}"
    script:
        "../scripts/house_keeping_gene_coverage.py"

↔ input / output files

Rule parameters Key Value Description
input bam "fusions/star_fusion/{sample}_{type}/Aligned.out.sorted.bam" Bam file
bai "fusions/star_fusion/{sample}_{type}/Aligned.out.sorted.bam.bai" Bam index file
bed config.get("reference", {}).get("design_bed_rna", "") Bed file with gene regions used to obtain genomic coordinates
output result "qc/house_keeping_gene_coverage/{sample}_{type}.house_keeping_gene_coverage.tsv" Excel frienly text report with gene coverage information

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container

Resources settings (resources.yaml)

Key Type Description
mem_mb integer memory in MB used per cpu
mem_per_cpu integer memory used per cpu
partition string partition to use on cluster
time string max execution time
threads integer number of threads to be available

report_fusions

Make a combined report of filtered fusion calls from all RNA fusion callers. Applies caller specific thresholds based on read support and fusion type. Annotates breakpoints to genes and exons when available. See further RNA fusions report info.

🐍 Rule

rule report_fusions:
    input:
        arriba="fusions/arriba/{sample}_{type}.fusions.tsv",
        bam="fusions/star_fusion/{sample}_{type}/Aligned.out.sorted.bam",
        bai="fusions/star_fusion/{sample}_{type}/Aligned.out.sorted.bam.bai",
        bed=config.get("reference", {}).get("design_bed_rna", ""),
        bed_extra_annotation=config.get("report_fusions", {}).get("annotation_bed", ""),
        dedup_coverage="qc/mosdepth/{sample}_{type}.regions.bed.gz",
        fusioncatcher="fusions/fusioncatcher/{sample}_{type}/final-list_candidate-fusion-genes.hg19.txt",
        star_fusion="fusions/star_fusion/{sample}_{type}/star-fusion.fusion_predictions.abridged.coding_effect.tsv",
    output:
        fusions=temp("fusions/report_fusions/{sample}_{type}.fusion_report.tsv"),
    params:
        fp_fusions=config.get("report_fusions", {}).get("fp_fusions", ""),
        fusioncatcher_flag_low_support=config.get("report_fusions", {}).get("fusioncatcher_flag_low_support", 15),
        fusioncatcher_low_support=config.get("report_fusions", {}).get("fusioncatcher_low_support", 3),
        fusioncatcher_low_support_inframe=config.get("report_fusions", {}).get("fusioncatcher_low_support_inframe", 6),
        star_fusion_flag_low_support=config.get("report_fusions", {}).get("star_fusion_flag_low_support", 15),
        star_fusion_low_support=config.get("report_fusions", {}).get("star_fusion_low_support", 2),
        star_fusion_low_support_inframe=config.get("report_fusions", {}).get("star_fusion_low_support_inframe", 6),
    log:
        "qc/report_fusions/{sample}_{type}.fusion_report.tsv.log",
    threads: config.get("report_fusions", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("report_fusions", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("report_fusions", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("report_fusions", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("report_fusions", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("report_fusions", {}).get("time", config["default_resources"]["time"]),
    benchmark:
        repeat(
            "qc/report_fusions/{sample}_{type}.fusion_report.tsv.benchmark.tsv",
            config.get("report_fusions", {}).get("benchmark_repeats", 1),
        )
    container:
        config.get("report_fusions", {}).get("container", config["default_container"])
    message:
        "{rule}: Collect and filter fusions and create report: {output.fusions}"
    script:
        "../scripts/report_fusions.py"

↔ input / output files

Rule parameters Key Value Description
input arriba "fusions/arriba/{sample}_{type}.fusions.tsv" Called fusions by Arriba
bam "fusions/star_fusion/{sample}_{type}/Aligned.out.sorted.bam" Bam file
bai "fusions/star_fusion/{sample}_{type}/Aligned.out.sorted.bam.bai" Bam index file
bed config.get("reference", {}).get("design_bed_rna", "") Bed file with regions of the design
bed_extra_annotation config.get("report_fusions", {}).get("annotation_bed", "") Bed file with gene annotations of regions outide of the design
fusioncatcher "fusions/fusioncatcher/{sample}_{type}/final-list_candidate-fusion-genes.hg19.txt" Called fusions by FusionCatcher
star_fusion "fusions/star_fusion/{sample}_{type}/star-fusion.fusion_predictions.abridged.coding_effect.tsv" Called fusions by StarFusion
output fusions "fusions/report_fusions/{sample}_{type}.fusion_report.tsv" Excel friendly text report with filtered fusion calls from respective caller

🔧 Configuration

Software settings (config.yaml)

Key Type Description
annotation_bed string extra bed file for annotation of fusion partner gene exons not in design
fusioncather_flag_low_support integer lower limit of supporting reads to flag for low support for FusionCatcher
fusioncather_low_support integer lower limit of supporting reads to filter fusion for FusionCatcher
fusioncather_low_support_fp_genes integer lower limit of supporting reads to filter fusion in common fp genes for FusionCatcher
fusioncather_low_support_inframe integer lower limit of supporting reads to filter fusion that are inframe for FusionCatcher
star_fusion_flag_low_support integer lower limit of supporting reads to filter fusion for StarFusion
star_fusion_low_support integer lower limit of supporting reads to filter fusion for StarFusion
star_fusion_low_support_fp_genes integer lower limit of supporting reads to flag filter fusion in common fp genes for StarFusion
star_fusion_low_support_inframe integer lower limit of supporting reads to flag filter fusion that are inframe for StarFusion

Resources settings (resources.yaml)

Key Type Description
mem_mb integer memory in MB used per cpu
mem_per_cpu integer memory used per cpu
partition string partition to use on cluster
time string max execution time
threads integer number of threads to be available

report_gene_fuse

Make a report of filtered fusion calls from the DNA fusion caller Gene Fuse. Applies configurable thresholds for supporting reads and flags noisy genes as well as filters artifact genes. See further DNA fusions report info.

🐍 Rule

rule report_gene_fuse:
    input:
        fusions="fusions/gene_fuse/{sample}_{type}_gene_fuse_fusions.txt",
    output:
        report=temp("fusions/report_gene_fuse/{sample}_{type}.gene_fuse_report.tsv"),
    params:
        filter_fusions=config.get("report_gene_fuse", {}).get("filter_fusions", ""),
        min_unique_reads=config.get("report_gene_fuse", {}).get("min_unique_reads", 6),
    log:
        "fusions/report_gene_fuse/{sample}_{type}.gene_fuse_report.tsv.log",
    threads: config.get("report_gene_fuse", {}).get("threads", config["default_resources"]["threads"])
    resources:
        threads=config.get("report_gene_fuse", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("report_gene_fuse", {}).get("time", config["default_resources"]["time"]),
        mem_mb=config.get("report_gene_fuse", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("report_gene_fuse", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("report_gene_fuse", {}).get("partition", config["default_resources"]["partition"]),
    benchmark:
        repeat(
            "fusions/report_gene_fuse/{sample}_{type}.gene_fuse_report.tsv.benchmark.tsv",
            config.get("report_gene_fuse", {}).get("benchmark_repeats", 1),
        )
    container:
        config.get("report_gene_fuse", {}).get("container", config["default_container"])
    message:
        "{rule}: Collect and filter gene fuse dna fusions and create report: {output.report}"
    script:
        "../scripts/report_gene_fuse.py"

↔ input / output files

Rule parameters Key Value Description
input fusions "fusions/gene_fuse/{sample}_{type}_gene_fuse_fusions.txt" Called fusions by GeneFuse
output report "fusions/report_gene_fuse/{sample}_{type}.gene_fuse_report.tsv" Excel friendly text report with filtered fusion calls from GeneFuse

🔧 Configuration

Software settings (config.yaml)

Key Type Description
filter_fusions string file specifying fusions that should be filtered completely (value 0 in column 2) or have higher limit (value >0 in column 2)
min_unique_reads integer lower limit of uniquely supporting reads to report fusion

Resources settings (resources.yaml)

Key Type Description
mem_mb integer memory in MB used per cpu
mem_per_cpu integer memory used per cpu
partition string partition to use on cluster
time string max execution time
threads integer number of threads to be available

purecn_modify_vcf

Increases the MQB (mean base quality) value by 5 as the qualities are so bad for our samples.

🐍 Rule

rule purecn_modify_vcf:
    input:
        vcf="snv_indels/gatk_mutect2/{sample}_{type}.normalized.sorted.vep_annotated.filter.snv_hard_filter_purecn.bcftools_annotated_purecn.vcf",
    output:
        vcf="cnv_sv/purecn_modify_vcf/{sample}_{type}.normalized.sorted.vep_annotated.filter.snv_hard_filter_purecn.bcftools_annotated_purecn.mbq.vcf",
    params:
        extra=config.get("purecn_modify_vcf", {}).get("extra", ""),
    log:
        "cnv_sv/purecn_modify_vcf/{sample}_{type}.normalized.sorted.vep_annotated.filter.snv_hard_filter_purecn.bcftools_annotated_purecn.mbq.vcf.log",
    benchmark:
        repeat(
            "cnv_sv/purecn_modify_vcf/{sample}_{type}.normalized.sorted.vep_annotated.filter.snv_hard_filter_purecn.bcftools_annotated_purecn.mbq.vcf.benchmark.tsv",
            config.get("purecn_modify_vcf", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("purecn_modify_vcf", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("purecn_modify_vcf", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("purecn_modify_vcf", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("purecn_modify_vcf", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("purecn_modify_vcf", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("purecn_modify_vcf", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("purecn_modify_vcf", {}).get("container", config["default_container"])
    message:
        "{rule}: modify mbq in {input.vcf}"
    shell:
        # new lines are used to tell the linter that it is not an absolut path
        '(sed -r \'s/(.*)(\\MBQ=[0-9]+,)([0-9]+)(.*)/echo "\\1\\2$((\\3+5))\\4"/'
        "ge' {input.vcf} | "
        'sed -r \'s/(.*)(\\MBQ=)([0-9]+)(.*)/echo "\\1\\2$((\\3+5))\\4"/'
        "ge' > "
        "{output.vcf}) &> {log}"

↔ input / output files

Rule parameters Key Value Description
input vcf "snv_indels/gatk_mutect2/{sample}_{type}.normalized.sorted.vep_annotated.filter.snv_hard_filter_purecn.bcftools_annotated_purecn.vcf" hard filtered vcf with variants from mutect2 for purecn
output vcf "cnv_sv/purecn_modify_vcf/{sample}_{type}.normalized.sorted.vep_annotated.filter.snv_hard_filter_purecn.bcftools_annotated_purecn.mbq.vcf" hard filtered vcf with corrected MBQ values for purecn

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

sample_mixup_check

Compare ID-SNPs in the RNA samples to the DNA samples in the same analysis and report sample similarities to be able to discern sample mixups

🐍 Rule

rule sample_mixup_check:
    input:
        id_snp_vcf_dna=[
            "snv_indels/bcftools_id_snps/%s_%s.id_snps.vcf" % (sample, unit_type)
            for sample in samples.index
            for unit_type in get_unit_types(units, sample)
            if unit_type == "T"
        ],
        id_snp_vcf_rna=[
            "snv_indels/bcftools_id_snps/%s_%s.id_snps.vcf" % (sample, unit_type)
            for sample in samples.index
            for unit_type in get_unit_types(units, sample)
            if unit_type == "R"
        ],
    output:
        mixup_report="qc/sample_mixup_check/sample_mixup_check.tsv",
    params:
        extra=config.get("sample_mixup_check", {}).get("extra", ""),
        match_cutoff=config.get("sample_mixup_check", {}).get("match_cutoff", "0.7"),
    log:
        "twist_solid/sample_mixup_check/sample_mixup_check.tsv.log",
    benchmark:
        repeat(
            "twist_solid/sample_mixup_check/sample_mixup_check.tsv.benchmark.tsv",
            config.get("sample_mixup_check", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("sample_mixup_check", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("sample_mixup_check", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("sample_mixup_check", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("sample_mixup_check", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("sample_mixup_check", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("sample_mixup_check", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("sample_mixup_check", {}).get("container", config["default_container"])
    message:
        "{rule}: find most similar sample in rna ID-snps compared to dna ID-snps"
    script:
        "../scripts/sample_mixup_check.py"

↔ input / output files

Rule parameters Key Value Description
input id_snp_vcf_dna [ "snv_indels/bcftools_id_snps/%s_%s.id_snps.vcf" % (sample, unit_type) for sample in samples.index for unit_type in get_unit_types(units, sample) if unit_type == "T" ] list of vcf files with ID-SNP positions called by bcftools for the dna samples
id_snp_vcf_rna [ "snv_indels/bcftools_id_snps/%s_%s.id_snps.vcf" % (sample, unit_type) for sample in samples.index for unit_type in get_unit_types(units, sample) if unit_type == "R" ] list of vcf files with ID-SNP positions called by bcftools for the rna samples
output mixup_report "qc/sample_mixup_check/sample_mixup_check.tsv" text report where the most similar RNA and DNA sample pairs in one analysis are reported together with the percent similarity

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded
match_cutoff number minimum fraction of matching SNP calls for an RNA and DNA matched sample

Resources settings (resources.yaml)

Key Type Description
mem_mb integer memory in MB used per cpu
mem_per_cpu integer memory used per cpu
partition string partition to use on cluster
time string max execution time
threads integer number of threads to be available

somalier_best_match_report

Takes the somalier relate pairs output and finds the best match for each sample (DNA and RNA). The script handles the bidirectional nature of the pairs file to ensure every sample is reported with its top relatedness match.

🐍 Rule

rule somalier_best_match_report:
    input:
        pairs="qc/somalier_ungrouped/somalier_relate.pairs.tsv",
    output:
        report=temp("qc/somalier_ungrouped/somalier_best_match.tsv"),
    params:
        extra=config.get("somalier_best_match_report", {}).get("extra", ""),
        match_cutoff=config.get("somalier_best_match_report", {}).get("match_cutoff", 0.7),
    log:
        "qc/somalier_ungrouped/somalier_best_match.tsv.log",
    benchmark:
        repeat(
            "qc/somalier_ungrouped/somalier_best_match.tsv.benchmark.tsv",
            config.get("somalier_best_match_report", {}).get("benchmark_repeats", 1),
        )
    container:
        config.get("somalier_best_match_report", {}).get("container", config["default_container"])
    threads: config.get("somalier_best_match_report", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("somalier_best_match_report", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("somalier_best_match_report", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("somalier_best_match_report", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("somalier_best_match_report", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("somalier_best_match_report", {}).get("time", config["default_resources"]["time"]),
    message:
        "{rule}: generating best match report from somalier pairs"
    script:
        "../scripts/somalier_best_match.py"

↔ input / output files

Rule parameters Key Value Description
input pairs "qc/somalier_ungrouped/somalier_relate.pairs.tsv" Somalier relate pairs file
output report "qc/somalier_ungrouped/somalier_best_match.tsv" Somalier best match report

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded
match_cutoff number relatedness cutoff for considering two samples a match (default 0.7)

Resources settings (resources.yaml)

Key Type Description
mem_mb integer memory in MB used per cpu
mem_per_cpu integer memory used per cpu
partition string partition to use on cluster
time string max execution time
threads integer number of threads to be available

somalier_dna_dna_report

Takes the somalier relate pairs output and finds the best DNA-DNA match for each DNA sample. This is used to detect potential sample mixups between DNA samples.

🐍 Rule

rule somalier_dna_dna_report:
    input:
        pairs="qc/somalier_ungrouped/somalier_relate.pairs.tsv",
    output:
        report=temp("qc/somalier_ungrouped/somalier_dna_dna_match.tsv"),
    params:
        extra=config.get("somalier_dna_dna_report", {}).get("extra", ""),
        match_cutoff=config.get("somalier_dna_dna_report", {}).get("match_cutoff", 0.85),
    log:
        "qc/somalier_ungrouped/somalier_dna_dna_match.tsv.log",
    benchmark:
        repeat(
            "qc/somalier_ungrouped/somalier_dna_dna_match.tsv.benchmark.tsv",
            config.get("somalier_dna_dna_report", {}).get("benchmark_repeats", 1),
        )
    container:
        config.get("somalier_dna_dna_report", {}).get("container", config["default_container"])
    threads: config.get("somalier_dna_dna_report", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("somalier_dna_dna_report", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("somalier_dna_dna_report", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("somalier_dna_dna_report", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("somalier_dna_dna_report", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("somalier_dna_dna_report", {}).get("time", config["default_resources"]["time"]),
    message:
        "{rule}: generating DNA-DNA best match report from somalier pairs"
    script:
        "../scripts/somalier_dna_dna_match.py"

↔ input / output files

Rule parameters Key Value Description
input pairs "qc/somalier_ungrouped/somalier_relate.pairs.tsv" Somalier relate pairs file
output report "qc/somalier_ungrouped/somalier_dna_dna_match.tsv" Somalier DNA-DNA best match report

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
match_cutoff number relatedness cutoff for match (default 0.85)

Resources settings (resources.yaml)

Key Type Description
mem_mb integer memory in MB used per cpu
mem_per_cpu integer memory used per cpu
partition string partition to use on cluster
time string max execution time
threads integer number of threads to be available