...
Code Block | ||
---|---|---|
| ||
mkdir -p $SCRATCH/core_ngs/references/fasta/tmp cd $SCRATCH/core_ngs/references/fasta/tmp |
V. cholerae has two chromosomes. We download each separately.
- Navigate to http://www.ncbi.nlm.nih.gov/nuccore/NC_012582
- on the top left part of the page, click on the down arrow following the word GenBank
- select FASTA
- select Complete Record
- select File as Destination, and Format FASTA
- click Create File
- in the Opening File dialog, select Save File then OK
- Save the file on your local computer as NC012582.fasta
- on the top left part of the page, click on the down arrow following the word GenBank
- Repeat steps 1 and 2 fot the 2nd chromosomeNCBI URL is Back on the main http://www.ncbi.nlm.nih.gov/nuccore/NC_012583you should now have 2 files, NC_012582.fasta and NC_012583.fasta
- Combine the 2 files into one using cat
- cat NC_012582 NC_012583 > vibCho.gbk
Converting GenBank records into sequence (FASTA) and annotation (GFF) files
As noted earlier, many microbial genomes are available through repositories like GenBank that use specific file format conventions for storage and distribution of genome sequence and annotations. The GenBank file format is a text file that can be parsed to yield other files that are compatible with the pipelines we have been implementing.
Go ahead and look at some of the contents of a GenBank file with the following commands (execute these one at a time):
Code Block | ||
---|---|---|
| ||
cd $WORK/core_ngs/references
less vibCho.O395.gbk # use q to quit less
grep -A 5 ORIGIN vibCho.O395.gbk |
As the less command shows, the file begins with a description of the organism and some source information, and the contains annotations for each bacterial gene. The grep command shows that, indeed, there is sequence information here (flagged by the word ORIGIN) that could be exported into a FASTA file. There are a couple ways of extracting the information we want, namely the reference genome and the gene annotation information, but a convenient one (that is available through the module system at TACC) is BioPerl.
We load BioPerl like we have loaded other modules, with the caveat that we must load regular Perl before loading BioPerl:
Code Block | ||
---|---|---|
| ||
module load perl
module load bioperl |
These commands make several scripts directly available to you. The one we will use is called bp_seqconvert.pl, and it is a BioPerl script used to inter-convert file formats like FASTA, GBK, and others. This script produces two output files:
- a FASTA format file for indexing and alignment
- a GFF file (standing for General Feature Format) contains information about all genes (or, more generally, features) in the genome
- remember, annotations such as GFFs must always match the reference you are using
To see how to use the script, just execute:
Code Block | ||
---|---|---|
| ||
bp_seqconvert.pl |
Clearly, there are many file formats that we can use this script to convert. In our case, we are moving from genbank to fasta, so the commands we would execute to produce and view the FASTA files would look like this:
...
language | bash |
---|
...
- page
- click on the Send to down arrow (top right of page)
- select Complete Record
- select File as Destination, and Format GFF3
- click Create File
- in the Opening File dialog, select Save File then OK
- Save the file on your local computer as NC012582.gff3
- click on the Send to down arrow (top right of page)
- Repeat steps 1 and 2 for the 2nd chromosome
- NCBI URL is http://www.ncbi.nlm.nih.gov/nuccore/NC_012583
- use NC012583 as the filename prefix for the files you save
- you should now have 4 files:
- NC_012582.fasta, NC_012583.fasta,
- NC_012582.gff3, NC_012583.gff3
- Transfer the files from your local computer to TACC
- On a Mac or Windows with a WSL shell, use scp
- On Windows, use the pscp.exe PuTTy tool
- See
- Combine the 2 FASTQ files into one using cat
- cat NC_012582.fasta NC_012583.fasta > vibCho.fasta
- Combine the 2 GFF3 files into one using cat
- cat NC_012582 NC_012583 > vibCho.gbk
Now we have a reference sequence file that we can use with the bowtie2 reference builder, and ultimately align sequence data against.
...
When the job is complete you should have a cholera_rnaseq.sam file that you can examine using whatever commands you like.
Exercise
...
#5: Bowtie2 local alignment - Human microRNA-seq
Now we're going to switch over to a different aligner that was originally designed for very short reads and is frequently used for RNA-seq data. Accordingly, we have prepared another test microRNA-seq dataset for you to experiment with (not the same one you used cutadapt on). This data is derived from a human H1 embryonic stem cell (H1-hESC) small RNA dataset generated by the ENCODE Consortium – its about a half million reads.
...
Expand | ||
---|---|---|
| ||
We are looking at mapping quality values for both aligned and un-aligned records, but mapping quality only makes sense for aligned reads. This expression does not distinguish between mapping quality = 0 because the read mapped to multiple locations, and mapping quality = 0 because the sequence did not align. The proper solution will await the use of samtools to filter out unmapped reads. |
Exercise
...
#6: BWA-MEM - Human mRNA-seq
After bowtie2 came out with a local alignment option, it wasn't long before bwa developed its own local alignment algorithm called BWA-MEM (for Maximal Exact Matches), implemented by the bwa mem command. bwa mem has the following advantages:
...
Now, try aligning it with bwa aln like we did in Example #1, but first link to the hg19 bwa index directory. In this case, due to the size of the hg19 index, we are linking to Anna's scratch area INSTEAD of our own work area containing indexes that we built ourselves.
...