Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
mkdir -p $SCRATCH/core_ngs/references/fasta/tmp
cd $SCRATCH/core_ngs/references/fasta/tmp

V. cholerae has two chromosomes. We download each separately.

  1. Navigate to http://www.ncbi.nlm.nih.gov/nuccore/NC_012582
    • on the top left part of the page, click on the down arrow following the word GenBank
      • select FASTA
      on the FASTA page, click on the Send to down arrow (top right of page)
      • select Complete Record
      • select File as Destination, and Format FASTA
      • click Create File
    • in the Opening File dialog, select Save File then OK
      • Save the file on your local computer as NC012582.fasta
  2. Repeat steps 1 and 2 fot the 2nd chromosomeNCBI URL is Back on the main http://www.ncbi.nlm.nih.gov/nuccore/NC_012583you should now have 2 files, NC_012582.fasta  and NC_012583.fasta
  3. Combine the 2 files into one using cat
    • cat NC_012582  NC_012583 > vibCho.gbk

Converting GenBank records into sequence (FASTA) and annotation (GFF) files

As noted earlier, many microbial genomes are available through repositories like GenBank that use specific file format conventions for storage and distribution of genome sequence and annotations. The GenBank file format is a text file that can be parsed to yield other files that are compatible with the pipelines we have been implementing.

Go ahead and look at some of the contents of a GenBank file with the following commands (execute these one at a time):

Code Block
languagebash
cd $WORK/core_ngs/references
less vibCho.O395.gbk # use q to quit less
grep -A 5 ORIGIN vibCho.O395.gbk

As the less command shows, the file begins with a description of the organism and some source information, and the contains annotations for each bacterial gene. The grep command shows that, indeed, there is sequence information here (flagged by the word ORIGIN) that could be exported into a FASTA file. There are a couple ways of extracting the information we want, namely the reference genome and the gene annotation information, but a convenient one (that is available through the module system at TACC) is BioPerl.

We load BioPerl like we have loaded other modules, with the caveat that we must load regular Perl before loading BioPerl:

Code Block
languagebash
module load perl
module load bioperl

These commands make several scripts directly available to you. The one we will use is called bp_seqconvert.pl, and it is a BioPerl script used to inter-convert file formats like FASTA, GBK, and others. This script produces two output files:

  • a FASTA format file for indexing and alignment
  • GFF file (standing for General Feature Format) contains information about all genes (or, more generally, features) in the genome
    • remember, annotations such as GFFs must always match the reference you are using

To see how to use the script, just execute:

Code Block
languagebash
bp_seqconvert.pl

Clearly, there are many file formats that we can use this script to convert.  In our case, we are moving from genbank to fasta, so the commands we would execute to produce and view the FASTA files would look like this:

...

languagebash

...

  1. page
    • click on the Send to down arrow (top right of page)
      • select Complete Record
      • select File as Destination, and Format GFF3
      • click Create File
    • in the Opening File dialog, select Save File then OK
      • Save the file on your local computer as NC012582.gff3
  2. Repeat steps 1 and 2 for the 2nd chromosome
  3. Transfer the files from your local computer to TACC
    • On a Mac or Windows with a WSL shell, use scp
    • On Windows, use the pscp.exe PuTTy tool
    • See
  4. Combine the 2 FASTQ files into one using cat
    • cat NC_012582.fasta  NC_012583.fasta > vibCho.fasta
  5. Combine the 2 GFF3 files into one using cat
    • cat NC_012582  NC_012583 > vibCho.gbk


Now we have a reference sequence file that we can use with the bowtie2 reference builder, and ultimately align sequence data against.

...

When the job is complete you should have a cholera_rnaseq.sam file that you can examine using whatever commands you like.

Exercise

...

#5: Bowtie2 local alignment - Human microRNA-seq

Now we're going to switch over to a different aligner that was originally designed for very short reads and is frequently used for RNA-seq data. Accordingly, we have prepared another test microRNA-seq dataset for you to experiment with (not the same one you used cutadapt on). This data is derived from a human H1 embryonic stem cell (H1-hESC) small RNA dataset generated by the ENCODE Consortium – its about a half million reads.

...

Expand
titleAnswer

We are looking at mapping quality values for both aligned and un-aligned records, but mapping quality only makes sense for aligned reads. This expression does not distinguish between mapping quality = 0 because the read mapped to multiple locations, and mapping quality = 0 because the sequence did not align.

The proper solution will await the use of samtools to filter out unmapped reads.

Exercise

...

#6: BWA-MEM - Human mRNA-seq

After bowtie2 came out with a local alignment option, it wasn't long before bwa developed its own local alignment algorithm called BWA-MEM (for Maximal Exact Matches), implemented by the bwa mem command. bwa mem has the following advantages:

...

Now, try aligning it with bwa aln like we did in Example #1, but first link to the hg19 bwa index directory.  In this case, due to the size of the hg19 index, we are linking to Anna's scratch area INSTEAD of our own work area containing indexes that we built ourselves.

...