QC¶

See the qc hydra-genetics module documentation for more details on the softwares for the quality control. Default hydra-genetics settings/resources are used if no configuration is specified.

Pipeline output files:¶

results/rna/qc/multiqc_RNA.html
results/rna/{sample}_{type}/qc/{sample}_{type}.house_keeping_gene_coverage.tsv

MultiQC¶

A MultiQC html report is generated using MultiQC v1.11. The report starts with a general statistics table showing the most important QC-values followed by additional QC data and diagrams. The qc data is generated using FastQC, mosdepth, samtools, and picard.

Configuration¶

Software settings

multiqc: reports: RNA: qc_files - configuration of input files to MultiQC in the config file

# config.yaml
multiqc:
  container: "docker://hydragenetics/multiqc:1.11"
  reports:
    RNA:
      config: "config/multiqc_config_rna.yaml"
      included_unit_types: ["R"]
      qc_files:
        - "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq1_fastqc.zip"
        - "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq2_fastqc.zip"
        - "qc/mosdepth/{sample}_{type}.mosdepth.global.dist.txt"
        - "qc/mosdepth/{sample}_{type}.mosdepth.region.dist.txt"
        - "qc/picard_collect_alignment_summary_metrics/{sample}_{type}.alignment_summary_metrics.txt"
        - "qc/picard_collect_hs_metrics/{sample}_{type}.HsMetrics.txt"
        - "qc/samtools_stats/{sample}_{type}.samtools-stats.txt"
        - "qc/mosdepth/{sample}_{type}.regions.bed.gz"

FastQC¶

FastQC v0.11.9 is run on the raw fastq-files.

Configuration¶

Cluster resources

Options	Value
mem_mb	12288
mem_per_cpu	6144
threads	2

Samtools¶

Samtools stats v1.15 is run on Star aligned bam files generated by running Star-fusion.

Picard¶

Picard v2.25.0 is run on Star aligned bam files (from Star-fusion) collecting a number of metrics. The metrics calculated are listed below:

picard CollectAlignmentSummaryMetrics (using a fasta reference genome file)
picard CollectHsMetrics (using a fasta reference genome file, a design bed file, and with the option COVERAGE_CAP=5000)

Mosdepth¶

Mosdepth v0.3.2 is run on Star aligned bam files (from Star-fusion) together with a design bed file to gather coverage statistics both globally and locally per region.

Configuration¶

References

RNA design bed

Software settings

Options	Value	Description
extra	--no-per-base	Do no output coverage per base
extra	--fast-mode	Fast coverage calculations

House keeping gene coverage¶

House keeping gene coverage is reported by the in-house script house_keeping_gene_coverage.py (rule) that in turn uses Samtools depth to calculate the coverage. The RNA design bed file is used to define the regions that the coverage should be calculated in. The house keeping genes are listed below:

House keeping genes:
- GAPDH, GUSB, OAZ1, POLR2A

Configuration¶

Software settings for samtools (hard coded)

Options	Value	Description
-d	5000000	Max read depth
-a		Report all positions even it is 0 coverage (for correct average calculations)
-r	{region}	Gene exon region to analyse

References

RNA design bed

Result file¶

results/rna/qc/{sample}_{type}.house_keeping_gene_coverage.tsv

ID-SNP calling¶

The RNA design includes a number of probes covering SNPs that can be used to check if check that the RNA-sample is the same as the DNA-sample and thereby avoid sample swaps. The ID SNPs are called using bcftools mpileup and bcftools callv1.15 resulting in a vcf file.

Configuration¶

References

ID SNP bed file: ID_SNPs.bed
RNA fasta reference genome from Star-Fusion: GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play/ctat_genome_lib_build_dir/ref_genome.fa - (see references)

Software settings for bcftools mpileup (hard coded)

Option	Description
-O u	Output uncompressed BCF file (piped to bcftools call)
-d 1000000	max read depth to consider

Software settings for bcftools call (hard coded)

Option	Description
--skip-variants indels	only look at SNPs
-m	multiallelic caller
-O v	Output uncompressed vcf file (piped to bcftools call)

Result file¶

results/rna/{sample}_{type}/id_snps/{sample}_{type}.id_snps.vcf

Sample mixup check¶

The sample mixup check compares the ID-SNPs in the RNA sample to the DNA sample in the same analysis and reports sample similarities to be able to discern sample mixups. The check is performed by the in-house script sample_mixup_check.py (rule and config).

Somalier Best Match Report¶

The pipeline includes a Somalier-based relatedness check that identifies the best genetic match for each sample. Specifically, it reports the best matching RNA sample for each DNA sample, and vice versa. This is used to verify sample identity and detect potential mixups. The report is delivered to results/qc/sample_mixup_check_somalier.tsv.

Result file¶

results/sample_mixup_check.tsv