QC¶
See the qc hydra-genetics module documentation for more details on the softwares for the quality control. Default hydra-genetics settings/resources are used if no configuration is specified.
Pipeline output files:¶
results/rna/qc/multiqc_RNA.htmlresults/rna/{sample}_{type}/qc/{sample}_{type}.house_keeping_gene_coverage.tsv
MultiQC¶
A MultiQC html report is generated using MultiQC v1.11. The report starts with a general statistics table showing the most important QC-values followed by additional QC data and diagrams. The qc data is generated using FastQC, mosdepth, samtools, and picard.
Configuration¶
Software settings
multiqc: reports: RNA: qc_files- configuration of input files to MultiQC in the config file
# config.yaml
multiqc:
container: "docker://hydragenetics/multiqc:1.11"
reports:
RNA:
config: "config/multiqc_config_rna.yaml"
included_unit_types: ["R"]
qc_files:
- "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq1_fastqc.zip"
- "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq2_fastqc.zip"
- "qc/mosdepth/{sample}_{type}.mosdepth.global.dist.txt"
- "qc/mosdepth/{sample}_{type}.mosdepth.region.dist.txt"
- "qc/picard_collect_alignment_summary_metrics/{sample}_{type}.alignment_summary_metrics.txt"
- "qc/picard_collect_hs_metrics/{sample}_{type}.HsMetrics.txt"
- "qc/samtools_stats/{sample}_{type}.samtools-stats.txt"
- "qc/mosdepth/{sample}_{type}.regions.bed.gz"
FastQC¶
FastQC v0.11.9 is run on the raw fastq-files.
Configuration¶
Cluster resources
| Options | Value |
|---|---|
| mem_mb | 12288 |
| mem_per_cpu | 6144 |
| threads | 2 |
Samtools¶
Samtools stats v1.15 is run on Star aligned bam files generated by running Star-fusion.
Picard¶
Picard v2.25.0 is run on Star aligned bam files (from Star-fusion) collecting a number of metrics. The metrics calculated are listed below:
picard CollectAlignmentSummaryMetrics (using a fasta reference genome file)
picard CollectHsMetrics (using a fasta reference genome file, a design bed file, and with the option COVERAGE_CAP=5000)
Mosdepth¶
Mosdepth v0.3.2 is run on Star aligned bam files (from Star-fusion) together with a design bed file to gather coverage statistics both globally and locally per region.
Configuration¶
References
Software settings
| Options | Value | Description |
|---|---|---|
| extra | --no-per-base | Do no output coverage per base |
| extra | --fast-mode | Fast coverage calculations |
House keeping gene coverage¶
House keeping gene coverage is reported by the in-house script house_keeping_gene_coverage.py (rule) that in turn uses Samtools depth to calculate the coverage. The RNA design bed file is used to define the regions that the coverage should be calculated in. The house keeping genes are listed below:
- House keeping genes:
- GAPDH, GUSB, OAZ1, POLR2A
Configuration¶
Software settings for samtools (hard coded)
| Options | Value | Description |
|---|---|---|
| -d | 5000000 | Max read depth |
| -a | Report all positions even it is 0 coverage (for correct average calculations) | |
| -r | {region} | Gene exon region to analyse |
References
Result file¶
results/rna/qc/{sample}_{type}.house_keeping_gene_coverage.tsv
ID-SNP calling¶
The RNA design includes a number of probes covering SNPs that can be used to check if check that the RNA-sample is the same as the DNA-sample and thereby avoid sample swaps. The ID SNPs are called using bcftools mpileup and bcftools callv1.15 resulting in a vcf file.
Configuration¶
References
- ID SNP bed file:
ID_SNPs.bed - RNA fasta reference genome from Star-Fusion:
GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play/ctat_genome_lib_build_dir/ref_genome.fa- (see references)
Software settings for bcftools mpileup (hard coded)
| Option | Description |
|---|---|
| -O u | Output uncompressed BCF file (piped to bcftools call) |
| -d 1000000 | max read depth to consider |
Software settings for bcftools call (hard coded)
| Option | Description |
|---|---|
| --skip-variants indels | only look at SNPs |
| -m | multiallelic caller |
| -O v | Output uncompressed vcf file (piped to bcftools call) |
Result file¶
results/rna/{sample}_{type}/id_snps/{sample}_{type}.id_snps.vcf
Sample mixup check¶
The sample mixup check compares the ID-SNPs in the RNA sample to the DNA sample in the same analysis and reports sample similarities to be able to discern sample mixups. The check is performed by the in-house script sample_mixup_check.py (rule and config).
Somalier Best Match Report¶
The pipeline includes a Somalier-based relatedness check that identifies the best genetic match for each sample. Specifically, it reports the best matching RNA sample for each DNA sample, and vice versa. This is used to verify sample identity and detect potential mixups. The report is delivered to results/qc/sample_mixup_check_somalier.tsv.
Result file¶
results/sample_mixup_check.tsv