Some system doesn't allow access to internet making it impossible to have a pipeline that are dependent on resource hosted on web, like docker hub and github. This is solved by packing the pipeline and all dependencies:
- Pipeline code and environment
- Singularities
- Reference/design files
Preperations¶
Fetch the pipeline and install requirements
# Set Twist Solid version
TAG_OR_BRANCH="vX.Y.X"
# Clone selected version
git clone --branch ${TAG_OR_BRANCH} https://github.com/genomic-medicine-sweden/Twist_Solid.git
cd Twist_Solid
Environment¶
Create an environment, on a computer/server with access to internet, that can be moved to bianca.
Requires:
- conda
- conda-pack
# Build compressed file containing, named Twist_Solid_{TAG_OR_BRANCH}.tar.gz
# - Twist Solid Pipeline
# - snakemake-wrappers
# - hydra-genetics modules
# - conda env
# - config files
TAG_OR_BRANCH="vX.Y.X" bash build/build_conda.sh
The script build/build_conda.sh performs the following steps:
1. Clones the pipeline repository.
2. Creates a conda environment and installs requirements.
3. Packs the conda environment.
4. Clones snakemake-wrappers and hydra-genetics modules.
5. Downloads configuration files.
6. Updates configuration paths using envsubst.
7. Downloads containers (optional).
8. Packs everything into a tarball Twist_Solid_{TAG_OR_BRANCH}.tar.gz.
Download containers¶
# NOTE: singularity command need to be available for this step
hydra-genetics prepare-environment create-singularity-files -c config/config.yaml -o singularity_cache
Download reference files¶
# NextSeq
hydra-genetics --debug references download -o design_and_ref_files -v config/references/design_files.hg19.yaml -v config/references/nextseq.hg19.pon.yaml -v config/references/references.hg19.yaml
#NovaSeq, not all files are prepare for novaseq
hydra-genetics references download -o design_and_ref_files -v config/references/design_files.hg19.yaml -v config/references/novaseq.hg19.pon.yaml -v config/references/references.hg19.yaml
# Compress data
tar -czvf design_and_ref_files.tar.gz design_and_ref_files
Files/Folders¶
The following file/folders have been created and need to be moved to your server:
- file: design_and_ref_files.tar.gz
- file: Twist_Solid_{TAG_OR_BRANCH}.tar.gz
- folder: singularity_cache
On Server¶
Setup environment¶
Unpack environment and activate¶
# Extract tar.
TAG_OR_BRANCH="vX.Y.X"
tar -xvf Twist_Solid_${TAG_OR_BRANCH}.tar.gz
cd Twist_Solid_${TAG_OR_BRANCH}
mkdir venv && tar xvf env.tar.gz -C venv/
source venv/bin/activate
# Variables that will be used later
PATH_TO_ENV=${PWD}
PATH_TO_HYDRA_MODULES=${PWD}/hydra-genetics
PATH_TO_FOLDER_WITH_PIPELINE=${PWD}/Twist_Solid
Decompress reference files¶
tar -xvf design_and_ref_files.tar.gz
Singularities¶
Move singularity cache to a appropriate location
Modify config and profile¶
Resource¶
Make sure that config/resource.yaml match your system setup, ex:
- partition
- number of cores
- memory
config.data.hg19.yaml files¶
Point to uploaded reference files
# config/config.data.hg19.yaml
# Update the following lines:
# Adjust config so that all reference files have the {{REFERENCE_DATA}} variable
REFERENCE_DATA: "{EXTRACT_PATH}/design_and_ref_files"
PROJECT_DESIGN_DATA: "{{REFERENCE_DATA}}"
PROJECT_PON_DATA: "{{REFERENCE_DATA}}"
PROJECT_REF_DATA: "{{REFERENCE_DATA}}"
Config.yaml files¶
Set path for hydra-genetics modules
# Update the following line
hydra_local_path: "{PATH_TO_EXTRACTED_ENV}/hydra-genetics"
Add path to local singularities
# config/config.yaml
# Make sure the environment is active
cp config/config.yaml config/config.yaml.copy
hydra-genetics prepare-environment container-path-update -c config/config.yaml.copy -n config/config.yaml -p ${PATH_TO_singularity_cache}
The path to the apptainer cache can also be given once at the top of the config, much like the REFERENCE_DATA variable.
Profile¶
Copy a profile and modify it to match your system, exTwist_Solid_${TAG_OR_BRANCH}/Twist_Solid/profiles/bianca/config.yaml
# Found at Twist_Solid_{TAG_OR_BRANCH}/snakemake-wrappers, use absolute_path with 'git+file:/'
wrapper-prefix="PATH_TO_WRAPPERS"
# ex: wrapper-prefix: "git+file://proj/sens2022566/nobackup/patriksm/Twist_Solid_add-{TAG_OR_BRANCH}/snakemake-wrappers/"
# Update account info, change ADD_YOUR_ACCOUNT to your bianca project id
drmaa: " -A ADD_YOUR_ACCOUNT -N 1-1 -t {resources.time} -n {resources.threads} --mem={resources.mem_mb} --mem-per-cpu={resources.mem_per_cpu} --mem-per-cpu={resources.mem_per_cpu} --partition={resources.partition} -J {rule} -e slurm_out/{rule}_%j.err -o slurm_out/{rule}_%j.out"
Validate config files¶
# This will make sure that all design and reference files exists and haven't changed
# Warnings for possible file PATH/hydra-genetics and missing tbi files in config can be ignored
hydra-genetics --debug references validate -c config/config.yaml -c config/config.data.hg19.yaml -v config/references/design_files.hg19.yaml -v config/references/nextseq.hg19.pon.yaml -v config/references/references.hg19.yaml -p ${PATH_TO_design_and_ref_files}
Run Pipeline¶
# Create analysis
mkdir analysis
# Enter folder
cd analysis
# Copy config files
cp -r PATH_TO_UPDATED_CONFIGS/config .
# Create samples.tsv and units.tsv
# https://hydra-genetics.readthedocs.io/en/latest/create_sample_files/
# remember to update tumor content value (TC) in samples.tsv for DNA samples
hydra-genetics create-input-files -d PATH_TO_FASTQ_FILE -p NovaSeq6000 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
# Make sure slurm-drmaa is available
source /{PATH_TO_ENV}/venv/bin/activate
snakemake -s /{PATH_TO_PIPELINE}/Twist_Solid/workflow/Snakefile --profile ${PATH_TO_UPDATED_PROFILE}/Twist_Solid/profiles/bianca