References, panel of normals and design files

Easy setup

Download data

Use hydra-genetics to setup reference files. Remember to update config/config.data.hg19.yaml and include it when running an analysis.

# make sure hydra-genetics is available
# make sure that TMPDIR points to a location with a lot of storage, it
# will be required to fetch reference data
export TMPDIR=/PATH_TO_STORAGE
# NextSeq
 hydra-genetics --debug references download -o design_and_ref_files -v config/references/design_files.hg19.yaml -v config/references/nextseq.hg19.pon.yaml -v config/references/references.hg19.yaml

 #NovaSeq, not all files are prepare for novaseq
 hydra-genetics references download -o design_and_ref_files -v config/references/design_files.hg19.yaml -v config/references/novaseq.hg19.pon.yaml -v config/references/references.hg19.yaml

Validate if data requires update

To validate if all design and reference files are up to date the following command can be run, assuming that they are store at the same parent folder.

# This will make sure that all design and reference files exists and haven't changed
# Warnings for possible file PATH/hydra-genetics and missing tbi files in config can be ignored
hydra-genetics --debug references validate -c config/config.yaml -c config/config.data.hg19.yaml -v config/references/design_files.hg19.yaml -v config/references/nextseq.hg19.pon.yaml -v config/references/references.hg19.yaml  -p ${PATH_TO_design_and_ref_files} 

References overview

The following reference files, panel of normals and design files are needed to run the Twist Solid Pipeline:

Rule Config name File
reference
background
background_panel_nextseq_noUmea_27_dp500_af015.tsv
artifacts
artifact_panel_nextseq_36.tsv
fasta
hg19.with.mt.fasta
fasta_rna GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play/ctat_genome_lib_build_dir/ref_genome.fa
dict hg19.with.mt.dict
fai hg19.with.mt.fai
design_bed
pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes.reannotated.230222.bed
design_intervals pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes.MUC6_31_rm.exon_only.reannotated.210608.interval_list
design_intervals_gatk_cnv
pool1_pool2_nochr_3c.sort.merged.padded20.cnv400.hg19.210311.met.annotated.bed.preprocessed.interval_list
design_bed_rna
Twist_RNA_Design5.annotated.bed
design_intervals_rna Twist_RNA_Design5.annotated.20230630.interval_list
arriba
assembly hg19.with.mt.fasta
blacklist
arriba/arriba_v2.3.0/database/blacklist_hg19_hs37d5_GRCh37_v2.3.0.tsv.gz
gtf
hg19.refGene.gtf
extra
arriba/arriba_v2.3.0/database/protein_domains_hg19_hs37d5_GRCh37_v2.3.0.gff3
extra
arriba/arriba_v2.3.0/database/known_fusions_hg19_hs37d5_GRCh37_v2.3.0.tsv.gz
arriba_draw_fusion
cytobands arriba/arriba_v2.3.0/database/cytobands_hg19_hs37d5_GRCh37_v2.3.0.tsv
gtf
hg19.refGene.gtf
protein_domains
arriba/arriba_v2.3.0/database/protein_domains_hg19_hs37d5_GRCh37_v2.3.0.gff3
annotate_cnv
cnv_amp_genes cnv_amp_genes_240307.bed
cnv_loh_genes
cnv_loh_genes.bed
bcftools_annotate annotation_db small_exac_common_3.hg19.vcf.gz
bcftools_filter_include_region
exon pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes.MUC6_31_rm.exon_only.reannotated.210608.bed
bcftools_filter_exclude_region
blacklist cnvkit_germline_blacklist_20221221.bed
bcftools_id_snps
snps_bed ID_SNPs.bed
bwa_mem
amb hg19.with.mt.amb
ann hg19.with.mt.ann
bwt hg19.with.mt.bwt
pac hg19.with.mt.pac
sa hg19.with.mt.sa
call_small_cnv_deletions
regions_file cnv_deletion_genes_240618.tsv
call_small_cnv_amplifications
regions_file cnv_amplification_genes_240307.tsv
cnvkit_batch
normal_reference cnvkit_nextseq_36.cnn
cnvkit_batch_hrd
normal_reference_hrd cnvkit_nextseq_27_HRD.cnn
exon_skipping
design_bed Twist_RNA_Design5.annotated.bed
fusioncatcher
genome_path human_v102/
gatk_collect_allelic_counts
SNP_interval gnomad_SNP_0.001_target.annotated.interval_list
gatk_denoise_read_counts
normal_reference gatk_cnv_nextseq_36.hdf5
gatk_get_pileup_summaries
sites small_exac_common_3.hg19.vcf.gz
variants small_exac_common_3.hg19.vcf.gz
gene_fuse
genes GMS560_fusion_w_pool2.hg19.221117.csv
fasta
hg19.with.mt.fasta
FuSeq_WES
transcript annotation UCSC_hg19_wes_contigSize3000_bigLen130000_r100.json
transcript database
UCSC_hg19_wes_contigSize3000_bigLen130000_r100.sqlite
fusion database
Mitelman_fusiondb.RData
paralog database
ensmbl_paralogs_grch37.RData
filter_report_fuseq_wes
transcript annotation hg19.refGene.gtf
gene white list
fuseq_wes_gene_white_list.txt
fusion gene pair black list
false_positive_fusion_pairs.txt
transcript black list
fuseq_wes_transcript_black_list.txt
hotspot_annotation
hotspots Hotspots_combined_regions_nodups.csv
hotspot_report
hotspot_mutations Hotspots_combined_regions_nodups.csv
jumble_run
normal_reference jumble.combined.filtered.50.PoN.hg19.RDS
manta_config_t
extra pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes.210608.bed.gz
msisensor_pro
PoN Msisensor_pro_reference_nextseq_36.list_baseline
purecn
extra mapping_bias_nextseq_27_hg19.rds
normaldb
normalDB_nextseq_27_hg19.rds
intervals
targets_twist-gms-st_hg19_25000_intervals.txt
purecn_coverage
intervals targets_twist-gms-st_hg19_25000_intervals.txt
report_fusions
annotation_bed Twist_RNA_fusionpartners.bed
report_gene_fuse
filter_fusions filter_fusions_20230214.csv
star
genome_index v2.7.10a_hg19/
extra
hg19.refGene.gtf
star_fusion
genome_path GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play/ctat_genome_lib_build_dir/
svdb_query
db_string all_TN_292_svdb_0.8_20220505.vcf
vep
vep_cache VEP/

Downloadable reference files

Fasta reference genome

hg19 with mitochondria but without HLA and without decoys

wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/*'
gunzip *.fa.gz
cat *.fa > hg19.with.mt.fasta

BWA indexes

bwa index hg19.with.mt.fasta

VEP v105

wget http://ftp.ensembl.org/pub/grch37/release-105/variation/vep/homo_sapiens_refseq_vep_105_GRCh37.tar.gz

GNOMAD common SNPs

Used by GATK pileup (CNV calling) and GATK contamination

gsutil cp gs://gatk-best-practices/somatic-b37/small_exac_common_3.vcf* .

Adapt file to hg19 (add chr at all lines and in header)

Mappability file

Used in CNVkit PoN creation

wget https://raw.githubusercontent.com/etal/cnvkit/master/data/access-5k-mappable.hg19.bed

Arriba v2.3.0

singularity exec -B /path/to/references:/references docker://hydragenetics/arriba:2.3.0 download_references.sh hs37d5+RefSeq
wget https://github.com/suhrig/arriba/releases/download/v2.3.0/arriba_v2.3.0.tar.gz
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.refGene.gtf.gz

Star genome index

singularity exec docker://hydragenetics/star:2.7.10a STAR --runThreadN 8 --runMode genomeGenerate --genomeDir star_index --genomeFastaFiles Human_genome.fasta

Fusioncather v102

wget http://sourceforge.net/projects/fusioncatcher/files/data/human_v102.tar.gz.aa
wget http://sourceforge.net/projects/fusioncatcher/files/data/human_v102.tar.gz.ab
wget http://sourceforge.net/projects/fusioncatcher/files/data/human_v102.tar.gz.ac
wget http://sourceforge.net/projects/fusioncatcher/files/data/human_v102.tar.gz.ad
cat human_v102.tar.gz.* | tar xz
ln -s human_v102 current

Star-fusion

wget https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play.tar.gz

Panel of normals

Files needed by Twist Solid that are generated by the reference pipeline are listed below. Many of the panel of normals are available to download from the Uppsala Owncloud solution but should preferably be generated on in-house data.

Result files

  • result/cnvkit.PoN.cnn
  • result/gatk_cnv_panel_of_normal.hdf5
  • result/design.preprocessed.interval_list
  • result/jumble.PoN.RDS
  • result/Msisensor_pro_reference.list_baseline
  • result/background_panel.tsv
  • result/artifact_panel.tsv
  • result/svdb_cnv.vcf
  • result/mapping_bias.rds
  • result/purecn_normal_db.rds
  • result/purecn_targets_intervals.txt

Create samples and units

Files and samples used in the generation of the panel of normals are specified in samples.tsv and units.tsv. Required files are listed down below. Adapt out file specification (workflow/rules/common_references.smk) and comment out files that should not be generated.

Run command

Run the pipeline in the same way as the standard pipeline, using a reference specific profile and Snakefile:

snakemake --profile profiles/uppsala_ref/ -s workflow/Snakefile_references.smk


Note
The units.tsv file needs to be adapted depending which panel of normals are created and should contain all the samples needed to create the panel of normals.

CNVkit

Reference files

  • design bedfile
  • fasta genome reference
  • mappability file

GATK CNV

Reference files

  • design bedfile
  • fasta reference genome
  • fasta reference genome dictionary

MSISensor-pro

Reference files

  • design bedfile
  • fasta reference genome

Software settings

Options Value Description
extra -c 50 minimal coverage, recommended for WES: 20; WGS: 15

SVDB

Should be made up of both normal and tumor FFPE samples!

Software settings

Options Value Description
extra --overlap 0.8 Overlap used to cluster variants (default 0.8)

Artifacts

Based on unfiltered and merged vcf files from normal FFPE samples

Background

Based on genome vcf files from Mutect2 from normal FFPE samples

Software settings

Options Value Description
min_dp 500 Min read depth to be included (default: 500)
max_af 0.015 Max allele frequency to be included (default: 0.015)

PureCN

Made up of bam and vcf files from normal samples

Pipeline specific files

Premade panel of normals, design files and references can be download using the hydra-genetics tools. Design files can also be downloaded for our github repo. Check the config yaml files in the Twist_Solid repo for the latest files.