Exercise #3: PE alignment with BioITeam scripts
Now that you've done everything the hard way, let's see how to do run an alignment pipeline using a BWA alignment script maintained by the BioITeam, /work2/projects/BioITeam/common/script/align_bwa_illumina.sh. Type in the script name to see its usage.
Code Block |
---|
|
align_bwa_illumina.sh 2021_06_05
Align Illumina SE or PE data with bwa. Produces a sorted, indexed,
duplicate-marked BAM file and various statistics files. Usage:
align_bwa_illumina.sh <aln_mode> <in_file> <out_pfx> <assembly> [ paired trim_sz trim_sz2 seq_fmt qual_fmt ]
Required arguments:
aln_mode Alignment mode, either global (bwa aln) or local (bwa mem).
in_file For single-end alignments, path to input sequence file.
For paired-end alignments using fastq, path to the the R1
fastq file which must contain the string 'R1' in its name.
The corresponding 'R2' must have the same path except for 'R1'.
out_pfx Desired prefix of output files in the current directory.
assembly One of hg38, hg19, hg38, mm10, mm9, sacCer3, sacCer1, ce11, ce10,
danRer7, hs_mirbase, mm_mirbase, or reference index prefix.
Optional arguments:
paired 0 = single end alignment (default); 1 = paired end.
trim_sz Size to trim reads to. Default 0 (no trimming)
trim_sz2 Size to trim R2 reads to for paired end alignments.
Defaults to trim_sz
seq_fmt Format of sequence file (fastq, bam or scarf). Default is
fastq if the input file has a '.fastq' extension; scarf
if it has a '.sequence.txt' extension.
qual_type Type of read quality scores (sanger, illumina or solexa).
Default is sanger for fastq, illumina for scarf.
Environment variables:
show_only 1 = only show what would be done (default not set)
aln_args other bowtie2 options (e.g. '-T 20' for mem, '-l 20' for aln)
no_markdup 1 = don't mark duplicates (default 0, mark duplicates)
run_fastqc 1 = run fastqc (default 0, don't run). Note that output
will be in the directory containing the fastq files.
keep 1 = keep unsorted BAM (default 0, don't keep)
bwa_bin BWA binary to use. Default bwa 0.7.x. Note that bwa 0.6.2
or earlier should be used for scarf and other short reads.
also: NUM_THREADS, BAM_SORT_MEM, SORT_THREADS, JAVA_MEM_ARG
Examples:
align_bwa_illumina.sh local ABC_L001_R1.fastq.gz my_abc hg38 1
align_bwa_illumina.sh global ABC_L001_R1.fastq.gz my_abc hg38 1 50
align_bwa_illumina.sh global sequence.txt old sacCer3 0 '' '' scarf solexa |
There are lots of bells and whistles in the arguments, but the most important are the first few:
- aln_mode – whether to perform a global or local alignment (the 1st argument must be one of those words)
- global mode uses the bwa aln workflow as we did above
- local mode uses the bwa mem command
- in_file – full or relative path to the FASTQ file (just the R1 fastq if paired end). Can be compressed (.gz)
- out_pfx – prefix for all the output files produced by the script. Should relate back to what the data is.
- assembly – genome assembly to use.
- there are pre-built indexes for some common eukaryotes (hg38, hg19, mm10, mm9, danRer7, sacCer3) that you can use
- or provide a full path for a bwa reference index you have built somewhere
- paired flag – 0 means single end (the default); 1 means paired end
- trim_sz – if you want the FASTQ hard trimmed down to a specific length before alignment, supply that number here
We're going to run this script and a similar Bowtie2 alignment script, on the yeast data using the TACC batch system. In a new directory, copy over the commands and submit the batch job. We ask for 2 hours (-t 02:00:00) with 4 tasks/node (-w 4); since we have 4 commands, this will run on 1 compute node.
Expand |
---|
title | Catch up (if needed) |
---|
|
Code Block |
---|
| # Copy over the Yeast data if needed
mkdir -p $SCRATCH/core_ngs/alignment/fastq
cp $CORENGS/alignment/Sample_Yeast*.gz $SCRATCH/core_ngs/alignment/fastq/
|
|
Code Block |
---|
language | bash |
---|
title | Run multiple alignments using the TACC batch system |
---|
|
# Make sure you're not in an idev session by looking at the hostname
hostname
# If the hostname looks like "c455-004.stampede2.tacc.utexas.edu", exit the idev session
# Make a new alignment directory for running these scripts
mkdir -p $SCRATCH/core_ngs/alignment/bwa_script
cd $SCRATCH/core_ngs/alignment/bwa_script
ln -s -f ../fastq
# Copy the alignment commands file and submit the batch job
cp $CORENGS/tacc/aln_script.cmds .
launcher_creator.py -j aln_script.cmds -n aln_script -t 02:00:00 -w 4 -a UT-2015-05-18 -q normal
sbatch --reservation=BIO_DATA_week_1 aln_script.slurm
showq -u |
While we're waiting for the job to complete, lets look at the aln_script.cmds file.
Code Block |
---|
language | bash |
---|
title | Commands to run multiple alignment scripts |
---|
|
/work2/projects/BioITeam/common/script/align_bwa_illumina.sh global ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz bwa_global sacCer3 1 50
/work2/projects/BioITeam/common/script/align_bwa_illumina.sh local ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz bwa_local sacCer3 1
/work2/projects/BioITeam/common/script/align_bowtie2_illumina.sh global ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz bt2_global sacCer3 1 50
/work2/projects/BioITeam/common/script/align_bowtie2_illumina.sh local ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz bt2_local sacCer3 1 |
Notes:
- The 1st command performs a paired-end BWA global alignment (similar to above), but asks that the 100 bp reads be trimmed to 50 first.
- we refer to the pre-built index for yeast by name: sacCer3
- this index is located in the /work/projects/BioITeam/ref_genome/bwa/bwtsw/sacCer3/ directory
- we provide the name of the R1 FASTQ file
- because we request a PE alignment (the 1 argument) the script will look for a similarly-named R2 file.
- all output files associated with this command will be named with the prefix bwa_global.
- The 2nd command performs a paired-end BWA local alignment.
- all output files associated with this command will be named with the prefix bwa_local.
- no trimming is requested because the local alignment should ignore 5' and 3' bases that don't match the reference genome
- The 3rd command performs a paired-end Bowtie2 global alignment.
- the Bowtie2 alignment script has the same first arguments as the BWA alignment script.
- all output files associated with this command will be named with the prefix bt2_global.
- again, we specify that reads should first be trimmed to 50 bp.
- The 4th command performs a paired-end Bowtie2 local alignment.
- all output files associated with this command will be named with the prefix bt2_local.
- again, no trimming is requested for the local alignment.
Output files
This alignment pipeline script performs the following steps:
- Hard trims FASTQ, if optionally specified (fastx_trimmer)
- Performs the global or local alignment (here, a PE alignment)
- BWA global: bwa aln the R1 and R2 separately, then bwa sampe to produce a SAM file
- BWA local: call bwa mem with both R1 and R2 to produce a SAM file
- Bowtie2 global: call bowtie2 --global with both R1 and R2 to produce a SAM file
- Bowtie2 local: call bowtie2 --local with both R1 and R2 to produce a SAM file
- Converts SAM to BAM (samtools view)
- Sorts the BAM (samtools sort)
- Marks duplicates (Picard MarkDuplicates)
- Indexes the sorted, duplicate-marked BAM (samtools index)
- Gathers statistics (samtools idxstats, samtools flagstat, plus a custom statistics script of Anna's)
- Removes intermediate files
There are a number of output files, with the most important being those desribed below.
- <prefix>.align.log – Log file of the entire alignment process.
- check the tail of this file to make sure the alignment was successful
- <prefix>.sort.dup.bam – Sorted, duplicate-marked alignment file.
- <prefix>.sort.dup.bam.bai – Index for the sorted, duplicate-marked alignment file
- <prefix>.flagstat.txt – samtools flagstat output
- <prefix>.idxstats.txt – samtools idxstats output
- <prefix>.samstats.txt – Summary alignment statistics from Anna's stats script
- <prefix>.iszinfo.txt – Insert size statistics (for paired-end alignments) from Anna's stats script
Verifying alignment success
The alignment log will have a "I ran successfully" message at the end if all went well, and if there was an error, the important information should also be at the end of the log file. So you can use tail to check the status of an alignment. For example:
Code Block |
---|
language | bash |
---|
title | Checking the alignment log file |
---|
|
tail bwa_global.align.log |
This will show something like:
Code Block |
---|
..Done alignmentUtils.pl bamstats - 2020-06-14 23:19:38
.. samstats file 'bwa_global.samstats.txt' exists and is not empty - 2020-06-14 23:19:38
===============================================================================
## Cleaning up files (keep 0) - 2020-06-14 23:19:38
===============================================================================
ckRes 0 cleanup
===============================================================================
## All bwa alignment tasks completed successfully! - 2020-06-14 23:19:38
=============================================================================== |
Notice that success message: "All bwa alignment tasks completed successfully!". It should only appear once in any successful alignment log.
When multiple alignment commands are run in parallel it is important to check them all, and you can use grep looking for part of the unique success message to do this. For example:
Code Block |
---|
language | bash |
---|
title | Count the number of successful alignments |
---|
|
grep 'completed successfully!' *align.log | wc -l |
If this command returns 4 (the number of alignment tasks we performed), all went well, and we're done.
But what if something went wrong? How can we tell which alignment task was not successful? You could tail the log files one by one to see which one(s) don't have the message, but you can also use a special grep option to do this work.
Code Block |
---|
language | bash |
---|
title | Check for failed alignment tasks |
---|
|
grep -L 'completed successfully' *.align.log |
The -L option tells grep to only print the filenames that don't contain the pattern. Perfect! To see happens in the case of failure, try it on a file that doesn't contain that message:
Code Block |
---|
|
grep -L 'completed successfully' aln_script.cmds |
Checking alignment statistics
The <prefix>.samstats.txt statistics files produced by the alignment pipeline has a lot of good information in one place. If you look at bwa_global.samstats.txt you'll see something like this:
Code Block |
---|
title | <prefix>.samstats.txt output |
---|
|
-----------------------------------------------
Aligner: bwa
Total sequences: 1184360
Total mapped: 539079 (45.5 %)
Total unmapped: 645281 (54.5 %)
Primary: 539079 (100.0 %)
Secondary:
Duplicates: 249655 (46.3 %)
Fwd strand: 267978 (49.7 %)
Rev strand: 271101 (50.3 %)
Unique hit: 503629 (93.4 %)
Multi hit: 35450 (6.6 %)
Soft clip:
All match: 531746 (98.6 %)
Indels: 7333 (1.4 %)
Spliced:
-----------------------------------------------
Total PE seqs: 1184360
PE seqs mapped: 539079 (45.5 %)
Num PE pairs: 592180
F5 1st end mapped: 372121 (62.8 %)
F3 2nd end mapped: 166958 (28.2 %)
PE pairs mapped: 80975 (13.7 %)
PE proper pairs: 16817 (2.8 %)
----------------------------------------------- |
Since this was a paired end alignment there is paired-end specific information reported.
You can also view statistics on insert sizes for properly paired reads in the bwa_global.iszinfo.txt file. This tells you the average (mean) insert size, standard deviation, mode (most common value), and fivenum values (minimum, 1st quartile, median, 3rd quartile, maximum).
Code Block |
---|
title | <prefix>.iszinfo.txt output |
---|
|
Insert size stats for: bwa_global
Number of pairs: 16807 (proper)
Number of insert sizes: 406
Mean [-/+ 1 SD]: 296 [176 416] (sd 120)
Mode [Fivenum]: 228 [51 224 232 241 500] |
A quick way to check alignment stats if you have run multiple alignments is again to use grep. For example:
Code Block |
---|
language | bash |
---|
title | Review multiple alignment rates |
---|
|
grep 'Total mapped' *samstats.txt |
will produce output like this:
Code Block |
---|
bt2_global.samstats.txt: Total mapped: 602893 (50.9 %)
bt2_local.samstats.txt: Total mapped: 788069 (66.5 %)
bwa_global.samstats.txt: Total mapped: 539079 (45.5 %)
bwa_local.samstats.txt: Total mapped: 1008000 (76.5 % |
Exercise: How would you list the median insert size for all the alignments?
Expand |
---|
|
That information is in the *.iszinfo.txt files, on the line labeled Mode. The median value is th 3rd value in the 5 fivnum values; it is the 7th whitespace-separated field on the Mode line.
|
Expand |
---|
|
Use grep to isolate the Mode line, and awk to isolate the median value field: Code Block |
---|
| grep 'Mode' *.iszinfo.txt | awk '{print $1,"Median insert size:",$7}' |
|
TACC batch system considerations
The great thing about pipeline scripts like this is that you can perform alignments on many datasets in parallel at TACC, and they are written to take advantage of having multiple cores on TACC nodes where possible.
On the stampede2, with its 68 physical cores per node, they are designed to run best with no more than 4 tasks per node.
Tip |
---|
title | Always specify wayness 4 for alignment pipeline scripts |
---|
|
These alignment scripts should always be run with a wayness of 4 (-w 4) in the stampede2 batch system, meaning at most 4 commands per node. |
Exercise #4: Bowtie2 global alignment - Vibrio cholerae RNA-seq
...