Running the pipeline

Requirements

Recommended hardware

  • CPU: >10 cores per sample
  • Memory: 6GB per core
  • Storage: >75GB per sample

Note: Running the pipeline with less resources may work, but has not been tested.

Software

Nice to have

  • DRMAA compatible scheduler

Installation

A list of releases of the Twist Solid pipeline can be found at: Releases.

Clone the Twist Solid git repo

We recommend that the repository is cloned to your working directory.

# Set up a working directory path
WORKING_DIRECTORY="/path_working_to_directory"

Fetch pipeline

# Set version
VERSION="v0.4.0"

# Clone selected version
git clone --branch ${VERSION} https://github.com/genomic-medicine-sweden/Twist_Solid.git ${WORKING_DIRECTORY}

Create python environment

To run the Twist Solid pipeline a python virtual environment is needed.

# Enter working directory
cd ${WORKING_DIRECTORY}

# Create a new virtual environment
python3 -m venv ${WORKING_DIRECTORY}/virtual/environment

Install pipeline requirements

Activate the virtual environment and install pipeline requirements specified in requirements.txt.

# Enter working directory
cd ${WORKING_DIRECTORY}

# Activate python environment
source virtual/environment/bin/activate

# Install requirements
pip install -r requirements.txt

Setup required data and config

Download data

# make sure hydra-genetics is available
# make sure that TMPDIR points to a location with a lot of storage, it
# will be required to fetch reference data
# export TMPDIR=/PATH_TO_STORAGE
hydra-genetics --debug --verbose references download -o design_and_ref_files  -v config/references/references.hg19.yaml -v config/references/design_files.hg19.yaml -v config/references/nextseq.hg19.pon.yaml

Update config

# file config/config.data.hg19.yaml
# change rows:
PROJECT_DESIGN_DATA: "PATH_TO/design_and_ref_files" # parent folder for GMS560 design, ex GMS560/design
PROJECT_PON_DATA: "PATH_TO/design_and_ref_files" # artifact/background/PoN, ex GMS560/PoN
PROJECT_REF_DATA: "PATH_TO/design_and_ref_files" # parent folder for ref_data, ex ref_data/hg19

Input sample files

The pipeline uses sample input files (samples.tsv and units.tsv) with information regarding sample information, sequencing meta information as well as the location of the fastq-files. Specification for the input files can be found at Twist Solid schemas. Using the python virtual environment created above it is possible to generate these files automatically using hydra-genetics create-input-files:

hydra-genetics create-input-files -d path/to/fastq-files/

Note: Sample names cannot include "_" (underscore)!

Run command

Using the activated python virtual environment created above, this is a basic command for running the pipeline:

snakemake --profile profiles/NAME_OF_PROFILE -s workflow/Snakefile


There are many additional snakemake running options some of which is listed below. However, options that are always used should be put in the profile.

  • --notemp - Saves all intermediate files. Good for development and testing different options.
  • --until - Runs only rules dependent on the specified rule.


Note: Remember to have singularity and drmaa available on the system where the pipeline will be run.