Using the TACC Parametric Launcher to speed up mapping
Overview
We're going to change the BASE variable for these exercises:
export BASE=/corral-repl/utexas/BioITeam/tacc_ngs
The allseqs_R1.fastq file has 2,434,300 sequences and required ~2-3 minutes to align to the human genome (depending on which aligner we chose) if we cranked up the amount of threading used. But, this is a relatively small file by NextGen standards, representing 0.06x coverage of the human genome. Consider a file 20x larger than this one, which will now take 40-60 minutes to align and represents just 1.2x coverage of the genome, and consider that we need 30-60 coverage to reliably call variants. How long are we willing to wait for alignments to process? 16h? 24h? We need to be able to distribute the work across multiple nodes, not just multiple processors.
Enter the TACC Parametric Launcher, which lets you submit a batch of commands all at once, to be processed on a dedicated set of compute nodes. Why not just submit a bunch of small jobs? Large, shared systems like Lonestar have limits on the number of jobs you can submit at once (50), and each job will have to wait independently in the queue. Due to the FairShare policy implemented on many large shared systems, this means your jobs will wait slightly longer each time a new one enters the queue and starts running. Even without FairShare, each job will have a finite, non-zero waiting time. The idea behind the Launcher is to consolidate all that computing into big request. You may wait a bit to get access to a larger number of nodes, but all in all, your computing will complete MUCH more quickly and efficiently.
Steps for using the TACC Parametric Launcher
- Load the Launcher module
- Set up a working directory
- Divide the work into separate files in that directory
- Write out alignment and processing commands for each chunk of work to a 'paramlist' file
- Send the paramlist to a program called 'launcher'
- Consolidate the results into a single result file (optional)
We're going to work through an example of this for Bowtie, and there's going to be some fancy shell scripting in here that is optional, but has the benefit of being portable to a lot of the things that we think you will want to do in the future. Feel free to borrow this example and use it in your own work!
Launcher Tutorial
Create a 'launcher' directory in your $WORK folder and cd into it. Next, copy over a utility script for splitting NextGen sequence files to this directory (courtesy of PerM https://code.google.com/p/perm/).
cp $BASE/splitReads.sh .
and copy our example launcher script to this working directory. It's too complicated to type in!
cp $BASE/bowtie-launcher.sh .
Now follow along as your instructor dissects the annotated bowtie-launcher script, and we'll submit at the very end. Then, you can say you've run a parallel NextGen job at TACC!
#!/bin/bash #$ -V #$ -cwd #$ -pe 1way 48 #$ -q normal #$ -l h_rt=01:00:00 #$ -A 20121008-NGS-ACES #$ -m be #$ -M vaughn@tacc.utexas.edu #$ -N bowtie-launcher # Load the bowtie AND launcher modules module load launcher module load bowtie/0.12.8 module load samtools # Simple variable to save typing BASE="/corral-repl/utexas/BioITeam/tacc_ngs" PREFIX="query" # Create a working directory to hold a lot of intermediate files TEMPDIR="tmp" mkdir -p $TEMPDIR # Now, run the handy splitReads utility # Usage: spitReads.sh sourceFile outputPrefix # ./splitReads.sh $BASE/human_variation/bigseqs_R1.fastq $TEMPDIR/$PREFIX # The temp directory will now contain 46 files containing ~1M reads each # Iterate over the 46 subfiles in the temp directory # Craft a bowtie alignment command and write it out to the paramlist file # touch bowtie-launcher.paramlist for C in ${TEMPDIR}/${PREFIX}_* do echo "time bowtie --threads 12 -x -t -S $BASE/human_variation/ref/hs37d5.fa ${C} ${C}.sam && samtools view -S -b ${C}.sam > ${C}.bam" >> bowtie-launcher.paramlist done # Submit to the TACC Launcher # EXECUTABLE=$TACC_LAUNCHER_DIR/init_launcher time $TACC_LAUNCHER_DIR/paramrun $EXECUTABLE bowtie-launcher.paramlist # Optional: Consolidate the BAM files into a single BAM # You can do this in a separate, dependent script so that # you free up all the other nodes associated with this task # but here we show it so you can see that the final result of this # workflow can be a single file # BAMS=${TEMPDIR}/*.bam samtools merge bigseqs_R1.bam ${BAMS}
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.