Genome Analysis Toolkit (GATK) . -- GVA2019
Overview
The Genome Analysis Toolkit (GATK) is a set of programs developed by the broad institute with an extensive website. As mentioned in the final presentation, it has the ability to perform much of the analysis required for calling genomic variants as well as many many other things. Why you may ask yourself did this magical tool only appear on the final day of the class? GATK uses read mappers, read aligners, variant callers, and all the other things (or similar things) that you have been introduced to throughout the course so we have actually been going over what you needed to know in smaller more digestible chunks.
This tutorial is quite small and does not showcase but the smallest drop in a bucket of what GATK is capable of doing. This is because the broad itself has developed many many tutorials for all the different things GATK does and extensive forums are available if the tutorials are not enough to get you through what you are trying to do. Finally, as the makers of the software they have put out and maintain what they regard as the best way to use their product in the form of 'best practices'. If you are going to use GATK, its a real real real good idea to make sure you are following their best practices because that is a situation where people will raise a big eyebrow if you say you are going against the flow.
While GATK is great, one stop shops often are often not the best at everything they do, don't be afraid to use other programs. Particularly following what other researchers are doing in your field
Objectives
- Load GATK on lonestar
- Use the sample data provided by the broad (through TACC) to verify that TACC is working
- Explore a little of what is under the hood.
Tutorial: Loading GATK
While you may think based on the overview that GATK is an obvious choice for a module on TACC, you may be surprised to learn that seemingly every other year TACC removes it as a module, and this is a bad year. On the plus side, it means that once we install it for you locally, the only issue will be if you need to update the version, and recent changes to GATK have made it much easier to work with.
cd $WORK/src # if you no such file or directory warning, I suggest you create this folder so you can think 'where do I put the programs I download using wget on tacc... $WORK/src' much in the same way you can think 'where do I put binary executable files after I install the files I download on tacc are extracted/installed ... $HOME/local/bin' wget https://github.com/broadinstitute/gatk/releases/download/4.1.2.0/gatk-4.1.2.0.zip unzip gatk-4.1.2.0.zip cp gatk-4.1.2.0/gatk $HOME/local/bin # again notes about not having a $HOME/local/bin directory cp gatk-4.1.2.0/*.jar $HOME/local/bin cds gatk -help # if this does not output a large list of colored text, try the following command and if that does not output colored text get my attention gatk --list
If you see 316 lines of a long scrolling output detailing some copyright information and a bunch of different commands everything is correctly loaded. While individual tools will require different options and the program itself takes many different options only 3 things are ALWAYS required:
flag | Description |
---|---|
Tool name, what tool are you trying to use | |
-R | Reference sequence file |
-I | Input bam file |
Stealing a nice mnemonic devices from a GATK toturial (which is condensed below), these 3 arguments don't have to be in this order, but if you learn them in this order, you will be able to remember them if you TRI. Remember, specific tools will require additional arguments.
Getting sample data
Rather than using sample data specifically for this tutorial, we will instead do a small tutorial based on our read mapping tutorial from day 2 of the course. Assuming you completed that tutorial you the following tutorial should work.
Next you need to convert the .sam file to a .bam file.
Tutorial: Use GATK to count the number of reads in a bam file
Using the following information we will use gatk the CountReads tool to count the number of reads in the SRR030257.bam file which was from the NC_012967.fasta reference file. Pay attention to the the words in bold and the table/discussion in the previous tutorial section and see if you can figure out how to do this on your own.
Did you get the result you expected?
In following that very helpful link you'll see a discussion about a .dict file that is required for fasta references that might normally be generated by GATK but as we did not do the read mapping inside of GATK we lack it. Based on the reading on the linked web page I came up with the following solution to create that dictionary file.
gatk CreateSequenceDictionary -R NC_012967.1.fasta -O NC_012967.1.dict
Now if we retry our CountReads command we get very different output (note that yours will have subtle differences in things like the names of directories):
Using GATK jar /opt/tacc_mounts/home1/01821/ded/local/bin/gatk-package-4.1.2.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/tacc_mounts/home1/01821/ded/local/bin/gatk-package-4.1.2.0-local.jar CountReads -R NC_012967.1.fasta -I SRR030257.bam 09:27:58.776 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/tacc_mounts/home1/01821/ded/local/bin/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so May 31, 2019 9:27:59 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. 09:27:59.973 INFO CountReads - ------------------------------------------------------------ 09:27:59.973 INFO CountReads - The Genome Analysis Toolkit (GATK) v4.1.2.0 09:27:59.973 INFO CountReads - For support and documentation go to https://software.broadinstitute.org/gatk/ 09:27:59.973 INFO CountReads - Executing as ded@login1 on Linux v4.4.103-6.38-default amd64 09:27:59.973 INFO CountReads - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_151-b12 09:27:59.973 INFO CountReads - Start Date/Time: May 31, 2019 9:27:58 AM CDT 09:27:59.973 INFO CountReads - ------------------------------------------------------------ 09:27:59.973 INFO CountReads - ------------------------------------------------------------ 09:27:59.974 INFO CountReads - HTSJDK Version: 2.19.0 09:27:59.974 INFO CountReads - Picard Version: 2.19.0 09:27:59.974 INFO CountReads - HTSJDK Defaults.COMPRESSION_LEVEL : 2 09:27:59.974 INFO CountReads - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 09:27:59.974 INFO CountReads - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 09:27:59.974 INFO CountReads - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 09:27:59.974 INFO CountReads - Deflater: IntelDeflater 09:27:59.974 INFO CountReads - Inflater: IntelInflater 09:27:59.974 INFO CountReads - GCS max retries/reopens: 20 09:27:59.974 INFO CountReads - Requester pays: disabled 09:27:59.974 INFO CountReads - Initializing engine 09:28:00.322 INFO CountReads - Done initializing engine 09:28:00.322 INFO ProgressMeter - Starting traversal 09:28:00.322 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute 09:28:11.059 INFO CountReads - 7600360 read(s) filtered by: WellformedReadFilter 09:28:11.061 INFO ProgressMeter - unmapped 0.2 0 0.0 09:28:11.062 INFO ProgressMeter - Traversal complete. Processed 0 total reads in 0.2 minutes. 09:28:11.062 INFO CountReads - Shutting down engine [May 31, 2019 9:28:11 AM CDT] org.broadinstitute.hellbender.tools.CountReads done. Elapsed time: 0.21 minutes. Runtime.totalMemory()=2847932416 Tool returned: 0
What in all that are we actually looking for you might ask?
09:28:11.059 INFO CountReads - 7600360 read(s) filtered by: WellformedReadFilter 09:28:11.062 INFO ProgressMeter - Traversal complete. Processed 0 total reads in 0.2 minutes.
This tells us that the bam file contains 7600360 total reads and that none were removed by any filtering options. The lack of anything being removed should make sense since we didn't try to filter anything out. Can you figure out why this might be different than the total 9555594 reads that were present in the original fastq files SRR030257_1.fastq and SRR030257_1.fastq (check them in your GVA_bowtie2_mapping directory)?
As mentioned this is a very small introduction to GATK adapted from one of the broad's tutorials which can be found here http://gatkforums.broadinstitute.org/gatk/discussion/1209/howto-run-the-gatk-for-the-first-time. Feel free to explore that link and the other tutorial links for taking GATK further.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.