...
- Prepare the vibCho reference index for bowtie2 from GenBank records
- Align reads using bowtie2, producing a SAM file
- Convert the SAM file to a BAM file (samtools view)
- Sort the BAM file by genomic location (samtools sort)
- Index the BAM file (samtools index)
- Gather simple alignment statistics (samtools flagstat and samtools idxstat)
Obtaining the GenBank
...
records
First prepare a directory to work in, and change to it:
Code Block | ||
---|---|---|
| ||
mkdir -p $SCRATCH/core_ngs/references/fasta/tmpvibCho cd $SCRATCH/core_ngs/references/fasta/tmpvibCho |
V. cholerae has two chromosomes. We download each separately.
- Navigate to http://www.ncbi.nlm.nih.gov/nuccore/NC_012582
- click on the Send to down arrow (top right of page)
- select Complete Record
- select File as Destination, and Format FASTA
- click Create File
- in the Opening File dialog, select Save File then OK
- Save the file on your local computer as NC_012582.fa
- click on the Send to down arrow (top right of page)
- Back on the main http://www.ncbi.nlm.nih.gov/nuccore/NC_012582 page
- click on the Send to down arrow (top right of page)
- select Complete Record
- select File as Destination, and Format GFF3
- click Create File
- in the Opening File dialog, select Save File then OK
- Save the file on your local computer as NC_012582.gff3
- click on the Send to down arrow (top right of page)
- Repeat steps 1 and 2 for the 2nd chromosome
- NCBI URL is http://www.ncbi.nlm.nih.gov/nuccore/NC_012583
- use NC_012583 as the filename prefix for the files you save
- you should now have 4 files:
- NC_012582.fa, NC_012582.gff3
- NC_012583.fa, NC_012583.gff3
- Transfer the files from your local computer to TACC
- to the ~/scratch/core_ngs/references/fasta/tmpvibCho directory created above
- On a Mac or Windows with a WSL shell, use scp from your laptop
- On Windows, use the pscp.exe PuTTy tool
- See Copying files between TACC and your laptop
- On a Mac or Windows with a WSL shell, use scp from your laptop
- to the ~/scratch/core_ngs/references/fasta/tmpvibCho directory created above
- Combine the 2 FASTQ files into one using cat
- cat NC_012582.fasta NC_012583.fa > vibCho.fa
Combine the 2 GFF3 files into one using cat
cat NC_012582 NC_012583 > vibCho.gbk
Expand | |||||
---|---|---|---|---|---|
| |||||
|
Once you have the 4 files locally in your $SCRATCH/core_ngs/references/vibCho directory, combine them using cat:
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/core_ngs/references/vibCho
cat NC_01258[23].fa > vibCho.O395.fa
cat NC_01258[23].gff3 > vibCho.O395.gff3 |
Now we have a reference sequence file that we can use with the bowtie2 reference builder, and ultimately align sequence data against. Recall from when we viewed the GenBank file that there are genome annotations available as well that we would like to extract into GFF format. However, the bp_seqconvert.pl script is designed to be used to convert sequence formats, not annotation formats. Fortunately, there is another script called bp_genbank2gff3.pl that can take a GenBank file and produce a GFF3 (the most recent format convention for GFF files) file. To run it and see the output, run these commands
Introducing bowtie2
First make sure you're in an idev session:
Code Block | ||
---|---|---|
| ||
bp_genbank2gff3.pl --format Genbank vibCho.O395.gbk
mv vibCho.O395.gbk.gff vibCho.O395.gff
less vibCho.O395.gff |
After the header lines, each feature in the genome is represented by a line that gives chromosome, start, stop, strand, and other information. Features are things like "mRNA," "CDS," and "EXON." As you would expect in a prokaryotic genome it is frequently the case that the gene, mRNA, CDS, and exon annotations are identical, meaning they share coordinate information. You could parse these files further using commands like grep and awk to extract, say, all exons from the full file or to remove the header lines that begin with #.
...
| |||
idev -m 180 -p normal -A UT-2015-05-18 -N 1 -n 68
# Answer "y" to the question about using the BIO_DATA_week_1 reservation
# But if that hangs ("PD") waiting for a node:
# - Ctrl-C to cancel the idev request
# - execute the idev command above again
# - but answer "n" to the reservation question |
Go ahead and load the bowtie2 module so we can examine some help pages and options. To do that, you must first load the perl module, and then the a specific version of bowtie2. options.
Code Block | ||
---|---|---|
| ||
module loadbiocontainers perl module load bowtie/2.2.0 |
Now that it's loaded, check out the options. There are a lot of them! In fact for the full range of options and their meaning, Google "Bowtie2 manual" and bring up that page (http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml). The Table of Contents is several pages long! Ouch!
This is the key to using bowtie2 - it allows you to control almost everything about its behavior, but that which make it the go-to aligner for specialized alignment tasks (e.g. aligning miRNA or other small reads). But it also makes it is much more challenging to use than bwa – and it's easier to screw things up too!
...
Before the alignment, of course, we've got to build a mirbase bowtie2- specific index using bowtie2-build (go . Go ahead and check out its options). Unlike for the aligner itself, we only need to worry about a few things here:
- reference_in file is just the FASTA file containing mirbase v20 sequences vibCho.O395.fa FASTA we built from GenBank records
- bt2_index_base is the prefix of where we want the files to go
...
- all the bowtie2-build output file
Here, to build the reference index for alignment, we actually only need the FASTA file, since annotations are often not necessary for alignment. (This is not always true - extensively spliced transcriptomes requires splice junction annotations to align RNA-seq data properly, but for now we will only use the FASTA file.).)
First create a directory specifically for the bowtie2 index, then build the index using bowtie-build.
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir -p $WORK$SCRATCH/core_ngs/references/bt2/vibCho mv $WORK/core_ngs/references/vibCho.O395.fa $WORKcd $SCRATCH/core_ngs/references/bt2/vibCho # Symlink to the fasta cd $WORK file you created ln -sf $SCRATCH/core_ngs/references/bt2/vibCho.O395.fa # or, to catch up: ln -s -f ../../fasta/sf $CORENGS/references/vibCho.O395.fa ls -la |
...
bowtie2-build |
...
Code Block | ||||
---|---|---|---|---|
| ||||
bowtie2-build vibCho.O395.fa vibCho.O395 |
This should also go pretty fast. You can see the resulting files using ls like before.
...