Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Overview

Note

We're going to change the BASE variable for these exercises:

export BASE=/corral-repl/utexas/BioITeam/tacc_ngs

The allseqs_R1.fastq file has 2,434,300 sequences and required ~2-3 minutes to align to the human genome (depending on which aligner we chose) if we cranked up the amount of threading used. But, this is a relatively small file by NextGen standards, representing 0.06x coverage of the human genome. Consider a file 20x larger than this one, which will now take 40-60 minutes to align and represents just 1.2x coverage of the genome, and consider that we need 30-60 coverage to reliably call variants. How long are we willing to wait for alignments to process? 16h? 24h? We need to be able to distribute the work across multiple nodes, not just multiple processors.

...

Create a 'launcher' directory in your $WORK folder and cd into it. Next, copy over a utility script for splitting NextGen sequence files to this directory (courtesy of PerM https://code.google.com/p/perm/).

Code Block
cp /corral-repl/utexas/BioITeam/tacc_ngs$BASE/splitReads.sh .

and copy our example launcher script to this working directory. It's too complicated to type in!

Code Block
cp /corral-repl/utexas/BioITeam/tacc_ngs$BASE/bowtie-launcher.sh .

Now follow along as your instructor dissects the annotated bowtie-launcher script, and we'll submit at the very end. Then, you can say you've run a parallel NextGen job at TACC!

Code Block
#!/bin/bash

#$ -V  
#$ -cwd 
#$ -pe 1way 48
#$ -q normal 
#$ -l h_rt=01:00:00
#$ -A 20121008-NGS-ACES
#$ -m be
#$ -M vaughn@tacc.utexas.edu
#$ -N bowtie-launcher

# Load the bowtie AND launcher modules
module load launcher
module load bowtie/0.12.8
module load samtools

# Simple variable to save typing
BASE="/corral-repl/utexas/BioITeam/tacc_ngs/human_variation/"
PREFIX="query"

# Create a working directory to hold a lot of intermediate files
TEMPDIR="tmp"
mkdir -p $TEMPDIR

# Now, run the handy splitReads utility
# Usage: spitReads.sh sourceFile outputPrefix
#
./splitReads.sh $BASE/human_variation/bigseqs_R1.fastq $TEMPDIR/$PREFIX
# The temp directory will now contain 46 files containing ~1M reads each

# Iterate over the 46 subfiles in the temp directory
# Craft a bowtie alignment command and write it out to the paramlist file
#
touch bowtie-launcher.paramlist
for C in ${TEMPDIR}/${PREFIX}_*
do
	echo "time bowtie --threads 12 -x -t -S $BASE/human_variation/ref/hs37d5.fa ${C} ${C}.sam && samtools view -S -b ${C}.sam > ${C}.bam" >> bowtie-launcher.paramlist
done

# Submit to the TACC Launcher
#
EXECUTABLE=$TACC_LAUNCHER_DIR/init_launcher
time $TACC_LAUNCHER_DIR/paramrun $EXECUTABLE bowtie-launcher.paramlist

# Optional: Consolidate the BAM files into a single BAM
# You can do this in a separate, dependent script so that
# you free up all the other nodes associated with this task
# but here we show it so you can see that the final result of this
# workflow can be a single file
#
BAMS=${TEMPDIR}/*.bam
samtools merge bigseqs_R1.bam ${BAMS}

Supplemental Material

  1. Using Rscript with the Launcher