IGV Tutorial -- GVA2022
- 1 Overview
- 2 Learning Objectives
- 3 Theory
- 4 Installing IGV
- 5 Viewing E. coli data in IGV
- 5.1 Data files
- 5.2 Prepare a GFF feature file for the reference sequence
- 5.3 Copy files to your local computer
- 5.3.1 This is probably the largest code box in the entire course, note that some of these lines likely extend beyond the right side of the window and you may need to scroll to the right to see the entire command
- 5.3.2 You can use the tar --help option to try to figure out how to extract the contents, but I suggest the the following command:
- 5.4 Load genome into IGV
- 5.5 Load mapped reads into IGV
- 5.6 Load variant calls into IGV
- 5.7 Navigating in IGV
- 5.8 Exercises
- 6 Viewing Human Genome Data in IGV
- 7 Optional Tutorial Exercises ...
- 8 Alternative genome browsers:
Overview
The Integrative Genomics Viewer (IGV) from the Broad Center allows you to view several types of data files involved in any NGS analysis that employs a reference genome, including how reads from a dataset are mapped, gene annotations, and predicted genetic variants.
Learning Objectives
Create a custom genome database (usually used for microbial genomes).
Load a pre-existing genome assembly (usually used for the genomes of model organisms and higher Eukaryotes).
Load output from mapping reads to a reference genome.
Load output from calling genetic variants.
Navigate the view of the genome and interpret the display of this data.
Theory
Because NGS datasets are very large, it is often impossible or inefficient to read them entirely into a computer's memory when searching for a specific piece of data. In order to more quickly retrieve the data we are interested in analyzing or viewing, most programs have a way of treating these data files as databases. Database indexes enable one to rapidly pull specific subsets of the data from them.
The Integrative Genomics Viewer is a program for reading several types of indexed database information, including mapped reads and variant calls, and displaying them on a reference genome. It is invaluable as a tool for viewing and interpreting the "raw data" of many NGS data analysis pipelines.
With all that being said, the goal of visualizing the data with IGV is not to look at every read, or even every base in the reference genome (which is actually the smaller of the 2 possibilities!). IGV works best to dig deeper on something that you already are interested in either because of the gene itself, or is something that seems confusing (ie trying to figure out why mapping quality declines for a particular variant), or are trying to familiarize yourself with the concepts of what is going on.
Installing IGV
This is done on your local computer not on TACC. IGV can not be installed on TACC which should make some sense to you as IGV is a program designed to let you visualize information and we know TACC doesn't allow GUIs.
There are multiple ways to launch IGV on a local computer. For this course I recommend in a separate a web browser window/tab going to: http://www.broadinstitute.org/software/igv/download and selecting the appropriate operating system with java included. Mac users will need to unzip the file and launch the application. Window's users will need to download, choose an installation location, agree to some licenses, navigate to your installation location and then launch the program. Believe it or not, this is a significantly improved process compared to actions that used to be required.
Viewing E. coli data in IGV
Data files
You can start this tutorial two ways:
If you have completed the Mapping tutorial and the SNV calling tutorial, then you should use those files for part 1 of this tutorial. You can proceed with either one alone, or with both.
Prepare a GFF feature file for the reference sequence
IGV likes its reference genome files in GFF (Gene Feature Format) rather than the fasta or gbk formats we've been working with. While you may assume this is a job for our old friend bp_seqconvert.pl, that script actually doesn't deal with GFF files. So, we're going to use another tool for sequence format conversion called Readseq. Install the readseq package from bioconda to a new conda environment to get started.
You have done this several times now, you should likely be able to do this without expanding this code block. Try to figure the command out yourself and check your work rather than relying on a copy paste.
conda create -n GVA-readseq -c bioconda readseq
readseq --versionUnlike several of the previous programs we have installed, the --version flag actually prints the entire help file rather than just versioning information. If you scroll up, you will see the first line of the output is actually the version. In this case, 2.1.30
readseq is a java based program which means it is envoked in a very different manner than anything we have worked with thus far. Luckily for us the conda package actually includes a wrapper allowing us to envoke the command simply by typing the readseq name like all the other programs we have worked with.
Review previous year's tutorial to read about how to envoke the program using java without the readseq wrapper
This is one of only 2 java based programs that this course covers. As the readseq wrapper conda provided makes this so much easier to envoke, we will use it. It is recommended to look back at a previous years tutorial to see how this was handled without the wrapper incase you encounter a java based program in your own work that doesn't have such a helpful wrapper, and need to know where to start.
It's a bit hard to figure out how to build the command yourself as, unlike most conventions, the program requires the unnamed arguments before the optional flag arguments, there is no example command in the help. To do the conversion that we want, and get things where they need to be for the rest of the tutorial use the following:
cds
mkdir GVA_IGV
cd GVA_IGV
readseq $SCRATCH/GVA_bowtie2_mapping/NC_012967.1.gbk -f GFF -o NC_012967.1.gbk.gff
A final oddity of the readseq program is that rather than displaying any kind of status message, or being silent when executed, the program actually displays the version of the readseq program itself. This is something that initially made me assume the conversion had failed. Take a look at the contents of the original Genbank file and the new GFF file and try to get a handle on what is going on in this conversion using commands like head and tail.
File naming conventions
You may notice that the output file appended a ".gff" ending to the ".gbk" ending rather than replacing it. This is can be done to demonstrate the order of operations performed on the file (in this case taking a gbk file and converting it to a gff file) a longer list of operations, such as sequential filtering a vcf file for frequency above 90%, with mapq scores above 20, on chromosome 7, between 10,000,000 and 190,000,000 bp might result in file contents looking like the following:
sample.vcfsample.vcf.filtered.freq90sample.vcf.filtered.freq90.mapq20sample.vcf.filtered.freq90.mapq20.chr7sample.vcf.filtered.freq90.mapq20.chr7.10MB-190MB
some programs do not like unknown or effectively nonsense file endings, in which case you may need to append ".vcf" to the names above .
Copy files to your local computer
Again, since IGV is an interactive graphical viewer program that we'll be running on our local computer, we need to get the files we want to visualize onto your desktop machine.
The files we need include:
Indexed reference FASTA files
GFF reference sequence feature files
Sorted and indexed mapped read BAM files
VCF result files
In practice, depending on your analysis you may need or can use additional files but those won't be discussed here
Rather than transferring each file individually, from multiple different directories, the easiest (and most common, and best practice) thing to do is:
create a new directory (suggest you include a keyword like export)
copy all the files you want to transfer into the new directory
compress the directory to speed up the transfer
In the case of this tutorial, since many of the tutorial output files had the same names (but resided in different directories) we need to be sure to give them unique destination names when you copy them into the new directory together. Additionally, to ensure you don't overwrite files that you want (here is another reminder about there being no undo command in linux) you can (and is good practice to) use the -n or -i option with the cp command. On stampede2 the -n command will not allow you to overwrite files, while the -i command will prompt you before overwriting anything. It is worth noting that there is a slight difference in different versions of linux/bash that with these command options and they may not work on all systems so double check with the help flag before using it.
This is probably the largest code box in the entire course, note that some of these lines likely extend beyond the right side of the window and you may need to scroll to the right to see the entire command
cds
mkdir GVA_IGV_export
cp -i $SCRATCH/GVA_IGV_Tutorial/NC_012967.1.gbk.gff GVA_IGV_export # this is the new file you just created above
cp -i $SCRATCH/GVA_bowtie2_mapping/NC_012967.1.fasta GVA_IGV_export
cp -i $SCRATCH/GVA_samtools_tutorial/NC_012967.1.fasta.fai GVA_IGV_export
cp -i $SCRATCH/GVA_samtools_tutorial/SRR030257.vcf GVA_IGV_export
cp -i $SCRATCH/GVA_samtools_tutorial/SRR030257.sorted.bam GVA_IGV_export/bowtie2.sorted.bam
cp -i $SCRATCH/GVA_samtools_tutorial/SRR030257.sorted.bam.bai GVA_IGV_export/bowtie2.sorted.bam.bai
tar -czvf GVA_IGV_export.tar.gz GVA_IGV_exportNow, copy the entire compressed IGV directory back to your local Desktop machine. Remember there is a separate tutorial that covers scp file transfers in more detail that can be found here. In this case, you would replace README in that tutorial with GVA_IGV_export.tar.gz and need to determine the full path to that file using the pwd command.