...
Expand |
---|
title | Expand here if you'd like to try this on your own... |
---|
|
Let's use Anna Battenhouse's shell script align_bwa.sh to align the fastq files to the hg19 version of the genome, and the python program launcher_creator.py to create the job submission script. Don't forget that for this to work, you need to have appended $BI/bin to your path. Expand |
---|
| Show me how to modify my path... |
---|
| Show me how to modify my path... |
---|
| BI="/corral-repl/utexas/BioITeam" PATH=$PATH:$BI/bin export PATH |
Move into your scratch directory and then try to figure out how to create and qsub the align_bwa.sh command to align the data file $BI/ngs_course/human_variation/allseqs_R1.fastq against the hg19 reference. Call the output "test". Expand |
---|
| Code Block |
---|
title | Make job submission script for mapping & variant calling |
---|
| echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq test hg19 1 > aln.test.log 2>&1" > commands
launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands
module load bwa
module load samtools
qsub map_call.sge
|
Caution: If you are using this example outside an SSC course, you must use the -a option to specify a valid allocation (e.g. BME_2012)
|
Expand |
---|
| Code Block |
---|
login1$ echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq test hg19 1" > commands
launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands
Job file has 1 lines.
Using 12 cores.
Launcher successfully created. Type "qsub map_call.sge" to queue your job.
login1$ qsub map_call.sge
-----------------------------------------------------------------
-- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
-----------------------------------------------------------------
--> Checking that you specified -V...
--> Checking that you specified a time limit...
--> Checking that you specified a queue...
--> Setting project...
--> Checking that you specified a parallel environment...
--> Checking that you specified a valid parallel environment name...
--> Checking that the number of PEs requested is valid...
--> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
--> Requesting valid memory configuration (23.4G)...
--> Verifying HOME file-system availability...
--> Verifying WORK file-system availability...
--> Verifying SCRATCH file-system availability...
--> Checking ssh setup...
--> Checking that you didn't request more cores than the maximum...
--> Checking that you don't already have the maximum number of jobs...
--> Checking that you don't already have the maximum number of jobs in queue development...
--> Checking that your time limit isn't over the maximum...
--> Checking available allocation...
--> Submitting job...
Your job 586249 ("map_call") has been submitted
|
|
Note that the input is paired-end data. Expand |
---|
| Your directory should have content like this when done (from ls -lt ): Code Block |
---|
| -rw-r--r-- 1 sphsmith G-801020 5732 May 20 23:01 map_call.o586338
-rw------- 1 sphsmith G-801020 392 May 20 23:01 test.flagstat.txt
-rw------- 1 sphsmith G-801020 2175952 May 20 23:01 test.sorted.bam.bai
-rw------- 1 sphsmith G-801020 334782188 May 20 23:01 test.sorted.bam
-rw-r--r-- 1 sphsmith G-801020 13135 May 20 23:00 map_call.e586338
-rw------- 1 sphsmith G-801020 411244396 May 20 23:00 test.bam
-rw------- 1 sphsmith G-801020 45695040 May 20 22:49 test.read2.sai
-rw------- 1 sphsmith G-801020 45372400 May 20 22:39 test.read1.sai
-rw-r--r-- 1 sphsmith G-801020 0 May 20 22:26 map_call.pe586338
|
and samtools flagstat test.sorted.bam should yield: Code Block |
---|
title | samtools flagstat results |
---|
| Running flagstat...
4546280 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
3992274 + 0 mapped (87.81%:nan%)
4546280 + 0 paired in sequencing
2273140 + 0 read1
2273140 + 0 read2
40290 + 0 properly paired (0.89%:nan%)
3636946 + 0 with itself and mate mapped
355328 + 0 singletons (7.82%:nan%)
44128 + 0 with mate mapped to a different chr
15634 + 0 with mate mapped to a different chr (mapQ>=5)
|
|
|
We would normally use the BAM file from the previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq
file above was derived, leading to some oddities in it).
$BI/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam
...
Expand |
---|
|
Code Block |
---|
login1$ echo "samtools mpileup -uf /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa \
> /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
> | bcftools view -vcg - > test.raw.vcf" > commands
login1$ launcher_creator.py -l call_vars.sge -n call_vars -t 01:00:00 -j commands
Job file has 1 lines.
Using 12 cores.
Launcher successfully created. Type "qsub call_vars.sge" to queue your job.
login1$ qsub call_vars.sge
-----------------------------------------------------------------
-- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
-----------------------------------------------------------------
--> Checking that you specified -V...
--> Checking that you specified a time limit...
--> Checking that you specified a queue...
--> Setting project...
--> Checking that you specified a parallel environment...
--> Checking that you specified a valid parallel environment name...
--> Checking that the number of PEs requested is valid...
--> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
--> Requesting valid memory configuration (23.4G)...
--> Verifying HOME file-system availability...
--> Verifying WORK file-system availability...
--> Verifying SCRATCH file-system availability...
--> Checking ssh setup...
--> Checking that you didn't request more cores than the maximum...
--> Checking that you don't already have the maximum number of jobs...
--> Checking that you don't already have the maximum number of jobs in queue development...
--> Checking that your time limit isn't over the maximum...
--> Checking available allocation...
--> Submitting job...
Your job 586369 ("call_vars") has been submitted
|
|
After your single-sample variant calling job completes
Expand |
---|
| If you don't want to wait, click here... |
---|
| If you don't want to wait, click here... |
---|
|
If you don't want to wait, CHANGE TO A NEW DIRECTORY and do this: ln -s $BI/ngs_course/human_variation/NA12878.chrom20.samtools.vcf test.raw.vcf and continue with the rest of these exercises. Remember to return to your job directory if you want to see your actual data. |
...
If you don't need a variant caller that uses a full Bayesian inference engine with robust filtering, protections, etc. but just want to look for weird base-artifacts within your mapped data, check out samtools mpileup
output directly.
GATK: A much more sophisticated option
http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr