References, panel of normals and design files¶

Easy setup¶

Download data¶

Use hydra-genetics to setup reference files. Remember to update config/config.data.hg19.yaml and include it when running an analysis.

# make sure hydra-genetics is available
# make sure that TMPDIR points to a location with a lot of storage, it
# will be required to fetch reference data
export TMPDIR=/PATH_TO_STORAGE
# NextSeq
 hydra-genetics --debug references download -o design_and_ref_files -v config/references/design_files.hg19.yaml -v config/references/nextseq.hg19.pon.yaml -v config/references/references.hg19.yaml

 #NovaSeq, not all files are prepare for novaseq
 hydra-genetics references download -o design_and_ref_files -v config/references/design_files.hg19.yaml -v config/references/novaseq.hg19.pon.yaml -v config/references/references.hg19.yaml

Validate if data requires update¶

To validate if all design and reference files are up to date the following command can be run, assuming that they are store at the same parent folder.

# This will make sure that all design and reference files exists and haven't changed
# Warnings for possible file PATH/hydra-genetics and missing tbi files in config can be ignored
hydra-genetics --debug references validate -c config/config.yaml -c config/config.data.hg19.yaml -v config/references/design_files.hg19.yaml -v config/references/nextseq.hg19.pon.yaml -v config/references/references.hg19.yaml  -p ${PATH_TO_design_and_ref_files}

References overview¶

The following reference files, panel of normals and design files are needed to run the Twist Solid Pipeline:

Rule	Config name	File
reference	background	`background_panel_nextseq_noUmea_27_dp500_af015.tsv`
	artifacts	`artifact_panel_nextseq_36.tsv`
	fasta	`hg19.with.mt.fasta`
	fasta_rna	`GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play/ctat_genome_lib_build_dir/ref_genome.fa`
	dict	`hg19.with.mt.dict`
	fai	`hg19.with.mt.fai`
	design_bed	`pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes.reannotated.230222.bed`
	design_intervals	`pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes.MUC6_31_rm.exon_only.reannotated.210608.interval_list`
	design_intervals_gatk_cnv	`pool1_pool2_nochr_3c.sort.merged.padded20.cnv400.hg19.210311.met.annotated.bed.preprocessed.interval_list`
	design_bed_rna	`Twist_RNA_Design5.annotated.bed`
	design_intervals_rna	`Twist_RNA_Design5.annotated.20230630.interval_list`
arriba	assembly	`hg19.with.mt.fasta`
	blacklist	`arriba/arriba_v2.3.0/database/blacklist_hg19_hs37d5_GRCh37_v2.3.0.tsv.gz`
	gtf	`hg19.refGene.gtf`
	extra	`arriba/arriba_v2.3.0/database/protein_domains_hg19_hs37d5_GRCh37_v2.3.0.gff3`
	extra	`arriba/arriba_v2.3.0/database/known_fusions_hg19_hs37d5_GRCh37_v2.3.0.tsv.gz`
arriba_draw_fusion	cytobands	`arriba/arriba_v2.3.0/database/cytobands_hg19_hs37d5_GRCh37_v2.3.0.tsv`
	gtf	`hg19.refGene.gtf`
	protein_domains	`arriba/arriba_v2.3.0/database/protein_domains_hg19_hs37d5_GRCh37_v2.3.0.gff3`
annotate_cnv	cnv_amp_genes	`cnv_amp_genes_240307.bed`
annotate_cnv	cnv_loh_genes	`cnv_loh_genes.bed`
bcftools_annotate	annotation_db	`small_exac_common_3.hg19.vcf.gz`
bcftools_filter_include_region	exon	`pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes.MUC6_31_rm.exon_only.reannotated.210608.bed`
bcftools_filter_exclude_region	blacklist	`cnvkit_germline_blacklist_20221221.bed`
bcftools_id_snps	snps_bed	`ID_SNPs.bed`
bwa_mem	amb	`hg19.with.mt.amb`
	ann	`hg19.with.mt.ann`
	bwt	`hg19.with.mt.bwt`
	pac	`hg19.with.mt.pac`
	sa	`hg19.with.mt.sa`
call_small_cnv_deletions	regions_file	`cnv_deletion_genes_240618.tsv`
call_small_cnv_amplifications	regions_file	`cnv_amplification_genes_240307.tsv`
cnvkit_batch	normal_reference	`cnvkit_nextseq_36.cnn`
cnvkit_batch_hrd	normal_reference_hrd	`cnvkit_nextseq_27_HRD.cnn`
exon_skipping	design_bed	`Twist_RNA_Design5.annotated.bed`
fusioncatcher	genome_path	`human_v102/`
gatk_collect_allelic_counts	SNP_interval	`gnomad_SNP_0.001_target.annotated.interval_list`
gatk_denoise_read_counts	normal_reference	`gatk_cnv_nextseq_36.hdf5`
gatk_get_pileup_summaries	sites	`small_exac_common_3.hg19.vcf.gz`
gatk_get_pileup_summaries	variants	`small_exac_common_3.hg19.vcf.gz`
gene_fuse	genes	`GMS560_fusion_w_pool2.hg19.221117.csv`
gene_fuse	fasta	`hg19.with.mt.fasta`
FuSeq_WES	transcript annotation	`UCSC_hg19_wes_contigSize3000_bigLen130000_r100.json`
	transcript database	`UCSC_hg19_wes_contigSize3000_bigLen130000_r100.sqlite`
	fusion database	`Mitelman_fusiondb.RData`
	paralog database	`ensmbl_paralogs_grch37.RData`
filter_report_fuseq_wes	transcript annotation	`hg19.refGene.gtf`
	gene white list	`fuseq_wes_gene_white_list.txt`
	fusion gene pair black list	`false_positive_fusion_pairs.txt`
	transcript black list	`fuseq_wes_transcript_black_list.txt`
hotspot_annotation	hotspots	`Hotspots_combined_regions_nodups.csv`
hotspot_report	hotspot_mutations	`Hotspots_combined_regions_nodups.csv`
jumble_run	normal_reference	`jumble.combined.filtered.50.PoN.hg19.RDS`
manta_config_t	extra	`pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes.210608.bed.gz`
msisensor_pro	PoN	`Msisensor_pro_reference_nextseq_36.list_baseline`
purecn	extra	`mapping_bias_nextseq_27_hg19.rds`
	normaldb	`normalDB_nextseq_27_hg19.rds`
	intervals	`targets_twist-gms-st_hg19_25000_intervals.txt`
purecn_coverage	intervals	`targets_twist-gms-st_hg19_25000_intervals.txt`
report_fusions	annotation_bed	`Twist_RNA_fusionpartners.bed`
report_gene_fuse	filter_fusions	`filter_fusions_20230214.csv`
star	genome_index	`v2.7.10a_hg19/`
star	extra	`hg19.refGene.gtf`
star_fusion	genome_path	`GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play/ctat_genome_lib_build_dir/`
svdb_query	db_string	`all_TN_292_svdb_0.8_20220505.vcf`
vep	vep_cache	`VEP/`

Downloadable reference files¶

Fasta reference genome¶

hg19 with mitochondria but without HLA and without decoys

wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/*'
gunzip *.fa.gz
cat *.fa > hg19.with.mt.fasta

BWA indexes¶

bwa index hg19.with.mt.fasta

VEP v105¶

wget http://ftp.ensembl.org/pub/grch37/release-105/variation/vep/homo_sapiens_refseq_vep_105_GRCh37.tar.gz

GNOMAD common SNPs¶

Used by GATK pileup (CNV calling) and GATK contamination

gsutil cp gs://gatk-best-practices/somatic-b37/small_exac_common_3.vcf* .

Adapt file to hg19 (add chr at all lines and in header)

Mappability file¶

Used in CNVkit PoN creation

wget https://raw.githubusercontent.com/etal/cnvkit/master/data/access-5k-mappable.hg19.bed

Arriba v2.3.0¶

singularity exec -B /path/to/references:/references docker://hydragenetics/arriba:2.3.0 download_references.sh hs37d5+RefSeq
wget https://github.com/suhrig/arriba/releases/download/v2.3.0/arriba_v2.3.0.tar.gz
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.refGene.gtf.gz

Star genome index¶

singularity exec docker://hydragenetics/star:2.7.10a STAR --runThreadN 8 --runMode genomeGenerate --genomeDir star_index --genomeFastaFiles Human_genome.fasta

Fusioncather v102¶

wget http://sourceforge.net/projects/fusioncatcher/files/data/human_v102.tar.gz.aa
wget http://sourceforge.net/projects/fusioncatcher/files/data/human_v102.tar.gz.ab
wget http://sourceforge.net/projects/fusioncatcher/files/data/human_v102.tar.gz.ac
wget http://sourceforge.net/projects/fusioncatcher/files/data/human_v102.tar.gz.ad
cat human_v102.tar.gz.* | tar xz
ln -s human_v102 current

Star-fusion¶

wget https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play.tar.gz

Panel of normals¶

Files needed by Twist Solid that are generated by the reference pipeline are listed below. Many of the panel of normals are available to download from the Uppsala Owncloud solution but should preferably be generated on in-house data.

Result files

result/cnvkit.PoN.cnn
result/gatk_cnv_panel_of_normal.hdf5
result/design.preprocessed.interval_list
result/jumble.PoN.RDS
result/Msisensor_pro_reference.list_baseline
result/background_panel.tsv
result/artifact_panel.tsv
result/svdb_cnv.vcf
result/mapping_bias.rds
result/purecn_normal_db.rds
result/purecn_targets_intervals.txt

Create samples and units¶

Files and samples used in the generation of the panel of normals are specified in samples.tsv and units.tsv. Required files are listed down below. Adapt out file specification (workflow/rules/common_references.smk) and comment out files that should not be generated.

Run command¶

Run the pipeline in the same way as the standard pipeline, using a reference specific profile and Snakefile:

snakemake --profile profiles/uppsala_ref/ -s workflow/Snakefile_references.smk

Note
The units.tsv file needs to be adapted depending which panel of normals are created and should contain all the samples needed to create the panel of normals.

CNVkit¶

Reference files

design bedfile
fasta genome reference
mappability file

GATK CNV¶

Reference files

design bedfile
fasta reference genome
fasta reference genome dictionary

MSISensor-pro¶

Reference files

design bedfile
fasta reference genome

Software settings

Options	Value	Description
extra	-c 50	minimal coverage, recommended for WES: 20; WGS: 15

SVDB¶

Should be made up of both normal and tumor FFPE samples!

Software settings

Options	Value	Description
extra	--overlap 0.8	Overlap used to cluster variants (default 0.8)

Artifacts¶

Based on unfiltered and merged vcf files from normal FFPE samples

Background¶

Based on genome vcf files from Mutect2 from normal FFPE samples

Software settings

Options	Value	Description
min_dp	500	Min read depth to be included (default: 500)
max_af	0.015	Max allele frequency to be included (default: 0.015)

PureCN¶

Made up of bam and vcf files from normal samples

Pipeline specific files¶

Premade panel of normals, design files and references can be download using the hydra-genetics tools. Design files can also be downloaded for our github repo. Check the config yaml files in the Twist_Solid repo for the latest files.