Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleExpand here if you'd like to try this on your own...

Let's use Anna Battenhouse's shell script align_bwa.sh to align the fastq files to the hg19 version of the genome, and the python program launcher_creator.py to create the job submission script.

Don't forget that for this to work, you need to have appended $BI/bin to your path.

Expand
Show me how to modify my path...
Show me how to modify my path...

BI="/corral-repl/utexas/BioITeam"
PATH=$PATH:$BI/bin
export PATH

Move into your scratch directory and then try to figure out how to create and qsub the align_bwa.sh command to align the data file $BI/ngs_course/human_variation/allseqs_R1.fastq against the hg19 reference. Call the output "test".

 

Expand
Solution
Solution
Code Block
titleMake job submission script for mapping & variant calling
echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq test hg19 1 > aln.test.log 2>&1" > commands
launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands
module load bwa
module load samtools
qsub map_call.sge

Caution: If you are using this example outside an SSC course, you must use the -a option to specify a valid allocation (e.g. BME_2012)


Expand
Output
Output
Code Block
login1$ echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq test hg19 1" > commands
launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands
Job file has 1 lines.
Using 12 cores.
Launcher successfully created. Type "qsub map_call.sge" to queue your job.
login1$ qsub map_call.sge
-----------------------------------------------------------------
-- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
-----------------------------------------------------------------
--> Checking that you specified -V...
--> Checking that you specified a time limit...
--> Checking that you specified a queue...
--> Setting project...
--> Checking that you specified a parallel environment...
--> Checking that you specified a valid parallel environment name...
--> Checking that the number of PEs requested is valid...
--> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
--> Requesting valid memory configuration (23.4G)...
--> Verifying HOME file-system availability...
--> Verifying WORK file-system availability...
--> Verifying SCRATCH file-system availability...
--> Checking ssh setup...
--> Checking that you didn't request more cores than the maximum...
--> Checking that you don't already have the maximum number of jobs...
--> Checking that you don't already have the maximum number of jobs in queue development...
--> Checking that your time limit isn't over the maximum...
--> Checking available allocation...
--> Submitting job...

Your job 586249 ("map_call") has been submitted

Note that the input is paired-end data.

Expand
Output
Output

Your directory should have content like this when done (from ls -lt):

Code Block
titleMapping output
-rw-r--r-- 1 sphsmith G-801020      5732 May 20 23:01 map_call.o586338
-rw------- 1 sphsmith G-801020       392 May 20 23:01 test.flagstat.txt
-rw------- 1 sphsmith G-801020   2175952 May 20 23:01 test.sorted.bam.bai
-rw------- 1 sphsmith G-801020 334782188 May 20 23:01 test.sorted.bam
-rw-r--r-- 1 sphsmith G-801020     13135 May 20 23:00 map_call.e586338
-rw------- 1 sphsmith G-801020 411244396 May 20 23:00 test.bam
-rw------- 1 sphsmith G-801020  45695040 May 20 22:49 test.read2.sai
-rw------- 1 sphsmith G-801020  45372400 May 20 22:39 test.read1.sai
-rw-r--r-- 1 sphsmith G-801020         0 May 20 22:26 map_call.pe586338

and samtools flagstat test.sorted.bam should yield:

Code Block
titlesamtools flagstat results
Running flagstat...
4546280 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
3992274 + 0 mapped (87.81%:nan%)
4546280 + 0 paired in sequencing
2273140 + 0 read1
2273140 + 0 read2
40290 + 0 properly paired (0.89%:nan%)
3636946 + 0 with itself and mate mapped
355328 + 0 singletons (7.82%:nan%)
44128 + 0 with mate mapped to a different chr
15634 + 0 with mate mapped to a different chr (mapQ>=5)


Single-sample variant calling with samtools

We would normally use the BAM file from the previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq file above was derived, leading to some oddities in it).
$BI/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam

...

Expand
Output
Output
Code Block
login1$ echo "samtools mpileup -uf /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa \
>   /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
>   | bcftools view -vcg - > test.raw.vcf" > commands
login1$ launcher_creator.py -l call_vars.sge -n call_vars -t 01:00:00 -j commands
Job file has 1 lines.
Using 12 cores.
Launcher successfully created. Type "qsub call_vars.sge" to queue your job.
login1$ qsub call_vars.sge
-----------------------------------------------------------------
-- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
-----------------------------------------------------------------
--> Checking that you specified -V...
--> Checking that you specified a time limit...
--> Checking that you specified a queue...
--> Setting project...
--> Checking that you specified a parallel environment...
--> Checking that you specified a valid parallel environment name...
--> Checking that the number of PEs requested is valid...
--> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
--> Requesting valid memory configuration (23.4G)...
--> Verifying HOME file-system availability...
--> Verifying WORK file-system availability...
--> Verifying SCRATCH file-system availability...
--> Checking ssh setup...
--> Checking that you didn't request more cores than the maximum...
--> Checking that you don't already have the maximum number of jobs...
--> Checking that you don't already have the maximum number of jobs in queue development...
--> Checking that your time limit isn't over the maximum...
--> Checking available allocation...
--> Submitting job...

Your job 586369 ("call_vars") has been submitted

After your single-sample variant calling job completes

Expand
If you don't want to wait, click here...
If you don't want to wait, click here...

If you don't want to wait, CHANGE TO A NEW DIRECTORY and do this:
ln -s $BI/ngs_course/human_variation/NA12878.chrom20.samtools.vcf test.raw.vcf
and continue with the rest of these exercises. Remember to return to your job directory if you want to see your actual data.

...

If you don't need a variant caller that uses a full Bayesian inference engine with robust filtering, protections, etc. but just want to look for weird base-artifacts within your mapped data, check out samtools mpileup output directly.

 

GATK: A much more sophisticated option

 

http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr