Page Comparison

...

Prepare the vibCho reference index for bowtie2 from GenBank records
Align reads using bowtie2, producing a SAM file
Convert the SAM file to a BAM file (samtools view)
Sort the BAM file by genomic location (samtools sort)
Index the BAM file (samtools index)
Gather simple alignment statistics (samtools flagstat and samtools idxstat)

Obtaining the GenBank

...

records

First prepare a directory to work in, and change to it:

Code Block

language	bash

mkdir -p $SCRATCH/core_ngs/references/fasta/tmpvibCho
cd $SCRATCH/core_ngs/references/fasta/tmpvibCho

V. cholerae has two chromosomes. We download each separately.

Navigate to http://www.ncbi.nlm.nih.gov/nuccore/NC_012582
- click on the Send to down arrow (top right of page)
  - select Complete Record
  - select File as Destination, and Format FASTA
  - click Create File
- in the Opening File dialog, select Save File then OK
  - Save the file on your local computer as NC_012582.fa
Back on the main http://www.ncbi.nlm.nih.gov/nuccore/NC_012582 page
- click on the Send to down arrow (top right of page)
  - select Complete Record
  - select File as Destination, and Format GFF3
  - click Create File
- in the Opening File dialog, select Save File then OK
  - Save the file on your local computer as NC_012582.gff3
Repeat steps 1 and 2 for the 2nd chromosome
- NCBI URL is http://www.ncbi.nlm.nih.gov/nuccore/NC_012583
- use NC_012583 as the filename prefix for the files you save
- you should now have 4 files:
  - NC_012582.fa, NC_012582.gff3
  - NC_012583.fa, NC_012583.gff3
Transfer the files from your local computer to TACC
- to the ~/scratch/core_ngs/references/fasta/tmpvibCho directory created above
  - On a Mac or Windows with a WSL shell, use scp from your laptop
  - On Windows, use the pscp.exe PuTTy tool
  - See Copying files between TACC and your laptop
Combine the 2 FASTQ files into one using cat
- cat NC_012582.fasta NC_012583.fa > vibCho.fa

Combine the 2

GFF3

files into one using

cat

cat NC_012582 NC_012583 > vibCho.gbk

Expand

title	Get the 4 downloaded files here

Code Block

language	bash

mkdir -p $SCRATCH/core_ngs/references/vibCho
cd $SCRATCH/core_ngs/references/vibCho
cp $CORENGS/ncbi/vibCho/NC_* .

Once you have the 4 files locally in your $SCRATCH/core_ngs/references/vibCho directory, combine them using cat:

Code Block

language	bash

cd $SCRATCH/core_ngs/references/vibCho
cat NC_01258[23].fa   > vibCho.O395.fa
cat NC_01258[23].gff3 > vibCho.O395.gff3

Now we have a reference sequence file that we can use with the bowtie2 reference builder, and ultimately align sequence data against. Recall from when we viewed the GenBank file that there are genome annotations available as well that we would like to extract into GFF format. However, the bp_seqconvert.pl script is designed to be used to convert sequence formats, not annotation formats. Fortunately, there is another script called bp_genbank2gff3.pl that can take a GenBank file and produce a GFF3 (the most recent format convention for GFF files) file. To run it and see the output, run these commands

Introducing bowtie2

First make sure you're in an idev session:

Code Block

language	bash

bp_genbank2gff3.pl --format Genbank vibCho.O395.gbk
mv vibCho.O395.gbk.gff vibCho.O395.gff
less vibCho.O395.gff

After the header lines, each feature in the genome is represented by a line that gives chromosome, start, stop, strand, and other information. Features are things like "mRNA," "CDS," and "EXON." As you would expect in a prokaryotic genome it is frequently the case that the gene, mRNA, CDS, and exon annotations are identical, meaning they share coordinate information. You could parse these files further using commands like grep and awk to extract, say, all exons from the full file or to remove the header lines that begin with #.

...


title	Start an idev session

idev -m 180 -p normal -A UT-2015-05-18 -N 1 -n 68 
# Answer "y" to the question about using the BIO_DATA_week_1 reservation
# But if that hangs ("PD") waiting for a node:
#   - Ctrl-C to cancel the idev request
#   - execute the idev command above again
#   - but answer "n" to the reservation question

Go ahead and load the bowtie2 module so we can examine some help pages and options. To do that, you must first load the perl module, and then the a specific version of bowtie2. options.

Code Block

language	bash

module loadbiocontainers
perl
module load bowtie/2.2.0

Now that it's loaded, check out the options. There are a lot of them! In fact for the full range of options and their meaning, Google "Bowtie2 manual" and bring up that page (http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml). The Table of Contents is several pages long! Ouch!

This is the key to using bowtie2 - it allows you to control almost everything about its behavior, but that which make it the go-to aligner for specialized alignment tasks (e.g. aligning miRNA or other small reads). But it also makes it is much more challenging to use than bwa – and it's easier to screw things up too!

...

Before the alignment, of course, we've got to build a mirbase bowtie2- specific index using bowtie2-build (go . Go ahead and check out its options). Unlike for the aligner itself, we only need to worry about a few things here:

reference_in file is just the FASTA file containing mirbase v20 sequences vibCho.O395.fa FASTA we built from GenBank records
bt2_index_base is the prefix of where we want the files to go

...

all the bowtie2-build output file

Here, to build the reference index for alignment, we actually only need the FASTA file, since annotations are often not necessary for alignment. (This is not always true - extensively spliced transcriptomes requires splice junction annotations to align RNA-seq data properly, but for now we will only use the FASTA file.).)

First create a directory specifically for the bowtie2 index, then build the index using bowtie-build.

Code Block

language	bash
title	Prepare Bowtie2 index files for vibCho

mkdir -p $WORK$SCRATCH/core_ngs/references/bt2/vibCho
mv $WORK/core_ngs/references/vibCho.O395.fa $WORKcd $SCRATCH/core_ngs/references/bt2/vibCho

# Symlink to the fasta
cd $WORK file you created
ln -sf $SCRATCH/core_ngs/references/bt2/vibCho.O395.fa
# or, to catch up:
ln -s -f ../../fasta/sf $CORENGS/references/vibCho.O395.fa
ls
-la

...

bowtie2-build

...

Code Block

language	bash
title	Prepare Bowtie2 index files

bowtie2-build vibCho.O395.fa vibCho.O395

This should also go pretty fast. You can see the resulting files using ls like before.

...

Versions Compared

Old Version 5

New Version 6

Key

Obtaining the GenBank

records

Introducing bowtie2