Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Prepare the vibCho reference index for bowtie2 from GenBank records
  2. Align reads using bowtie2, producing a SAM file
  3. Convert the SAM file to a BAM file (samtools view) 
  4. Sort the BAM file by genomic location (samtools sort)
  5. Index the BAM file (samtools index)
  6. Gather simple alignment statistics (samtools flagstat and samtools idxstat)

Obtaining the GenBank

...

records

First prepare a directory to work in, and change to it:

Code Block
languagebash
mkdir -p $SCRATCH/core_ngs/references/fasta/tmpvibCho
cd $SCRATCH/core_ngs/references/fasta/tmpvibCho

V. cholerae has two chromosomes. We download each separately.

  1. Navigate to http://www.ncbi.nlm.nih.gov/nuccore/NC_012582
    • click on the Send to down arrow (top right of page)
      • select Complete Record
      • select File as Destination, and Format FASTA
      • click Create File
    • in the Opening File dialog, select Save File then OK
      • Save the file on your local computer as NC_012582.fa
  2. Back on the main http://www.ncbi.nlm.nih.gov/nuccore/NC_012582 page
    • click on the Send to down arrow (top right of page)
      • select Complete Record
      • select File as Destination, and Format GFF3
      • click Create File
    • in the Opening File dialog, select Save File then OK
      • Save the file on your local computer as NC_012582.gff3
  3. Repeat steps 1 and 2 for the 2nd chromosome
  4. Transfer the files from your local computer to TACC
    • to the ~/scratch/core_ngs/references/fasta/tmpvibCho directory created above
  5. Combine the 2 FASTQ files into one using cat
    • cat NC_012582.fasta  NC_012583.fa > vibCho.fa
  6. Combine the 2 GFF3 files into one using cat
    cat NC_012582  NC_012583 > vibCho.gbk


Expand
titleGet the 4 downloaded files here
Code Block
languagebash
mkdir -p $SCRATCH/core_ngs/references/vibCho
cd $SCRATCH/core_ngs/references/vibCho
cp $CORENGS/ncbi/vibCho/NC_* .

Once you have the 4 files locally in your $SCRATCH/core_ngs/references/vibCho directory, combine them using cat:

Code Block
languagebash
cd $SCRATCH/core_ngs/references/vibCho
cat NC_01258[23].fa   > vibCho.O395.fa
cat NC_01258[23].gff3 > vibCho.O395.gff3

Now we have a reference sequence file that we can use with the bowtie2 reference builder, and ultimately align sequence data against. Recall from when we viewed the GenBank file that there are genome annotations available as well that we would like to extract into GFF format.  However, the bp_seqconvert.pl script is designed to be used to convert sequence formats, not annotation formats. Fortunately, there is another script called bp_genbank2gff3.pl that can take a GenBank file and produce a GFF3 (the most recent format convention for GFF files) file. To run it and see the output, run these commands

Introducing bowtie2

First make sure you're in an idev session:

Code Block
languagebash
bp_genbank2gff3.pl --format Genbank vibCho.O395.gbk
mv vibCho.O395.gbk.gff vibCho.O395.gff
less vibCho.O395.gff

After the header lines, each feature in the genome is represented by a line that gives chromosome, start, stop, strand, and other information.  Features are things like "mRNA," "CDS," and "EXON."  As you would expect in a prokaryotic genome it is frequently the case that the gene, mRNA, CDS, and exon annotations are identical, meaning they share coordinate information. You could parse these files further using commands like grep  and awk  to extract, say, all exons from the full file or to remove the header lines that begin with #.

...

titleStart an idev session
idev -m 180 -p normal -A UT-2015-05-18 -N 1 -n 68 
# Answer "y" to the question about using the BIO_DATA_week_1 reservation
# But if that hangs ("PD") waiting for a node:
#   - Ctrl-C to cancel the idev request
#   - execute the idev command above again
#   - but answer "n" to the reservation question

Go ahead and load the bowtie2 module so we can examine some help pages and options. To do that, you must first load the perl module, and then the a specific version of bowtie2options.

Code Block
languagebash
module loadbiocontainers
perl
module load bowtie/2.2.0

 Now that it's loaded, check out the options. There are a lot of them! In fact for the full range of options and their meaning, Google "Bowtie2 manual" and bring up that page (http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml). The Table of Contents is several pages long! Ouch!

This is the key to using bowtie2 - it allows you to control almost everything about its behavior, but that which make it the go-to aligner for specialized alignment tasks (e.g. aligning miRNA or other small reads). But it also makes it is much more challenging to use than bwa – and it's easier to screw things up too!

...

Before the alignment, of course, we've got to build a mirbase bowtie2- specific index using bowtie2-build (go . Go ahead and check out its options). Unlike for the aligner itself, we only need to worry about a few things here:

  • reference_in file is just the FASTA file containing mirbase v20 sequences vibCho.O395.fa FASTA we built from GenBank records
  • bt2_index_base is the prefix of where we want the files to go

...

  • all the bowtie2-build output file

Here, to build the reference index for alignment, we actually only need the FASTA file, since annotations are often not necessary for alignment. (This is not always true - extensively spliced transcriptomes requires splice junction annotations to align RNA-seq data properly, but for now we will only use the FASTA file.).)

First create a directory specifically for the bowtie2 index, then build the index using bowtie-build.

Code Block
languagebash
titlePrepare Bowtie2 index files for vibCho
mkdir -p $WORK$SCRATCH/core_ngs/references/bt2/vibCho
mv $WORK/core_ngs/references/vibCho.O395.fa $WORKcd $SCRATCH/core_ngs/references/bt2/vibCho

# Symlink to the fasta
cd $WORK file you created
ln -sf $SCRATCH/core_ngs/references/bt2/vibCho.O395.fa
# or, to catch up:
ln -s -f ../../fasta/sf $CORENGS/references/vibCho.O395.fa
ls
-la

...

bowtie2-build

...

Code Block
languagebash
titlePrepare Bowtie2 index files
bowtie2-build vibCho.O395.fa vibCho.O395

This should also go pretty fast. You can see the resulting files using ls like before.

...