Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleVisualize the mapped data...

The sorted.bam and sorted.bam.bai files can be used (along with the reference of course) with a mapping visualization program. There are several, but I'll be showing you the Broad Institute's Integrative Genomics Viewer (IGV)

Steps (I'm showing you how to do this from scratch on your own LOCAL computer - there are other ways...):

  1. Download and install IGV according to their instructions.
  2. Get all the .sorted.bam and .sorted.bam.bai files, along with the REL606.fna and REL606.gff reference genome information from biolinux-03 over to your local computer. For our course, I've posted these here at TACC.
  3. Start IGV
  4. Select, "Genomes" -> "Create .genome file" and pass it the REL606.fna file and the REL606.gff file as prompted.
  5. Select, "File" -> "Load from file" and select the .sorted.bam files (IGV will automatically use the index .bai files but they MUST be there).
  6. Surf!
Expand
titleCounting reads per gene... (also called "count" data)

OK, now we have reads aligned to a genome. How can we tell which genes those reads belonged to?

We need to use the gene annotations for that genome of course! That's the REL606.gff file.

One way to do this is to use a program which simply intersects features in the gene annotation file (i.e. genes) with the piled-up reads. We'll use bedtools to do this.

Code Block
bedtools coverage -abam  MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam -b REL606.gff > MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam.bedtools.out

OK, but I'm lazy - I want the computer to do this on all 6 files please...

Code Block
for files in `ls *.sorted.bam`; do bedtools coverage -abam $files  -b REL606.gff > $files.bedtools.out; done

 

That's pretty good, but that output file isn't really what we want for concise gene expression measures. We'd like to skinny it down - remove columns we don't need and only look at CDS elements:

Code Block
for file in `ls *.bedtools.out`; do cat $file | grep CDS | awk 'BEGIN {FS="\t"} {print $10}' > $file.counts ; done

 

And now I'd like to just have one table (Excel!!) with all the gene expression values. Here are some tricks to do that pretty efficiently:

Code Block
cat MURI_17_SA13172_ATGTCA_L007_R1_001.fastq.gz.sorted.bam.bedtools.out | grep CDS | cut -f 9 | awk 'BEGIN {FS=";"} {print $3":"$4}' > genes.txt
paste genes.txt MURI_17*counts MURI_26*counts MURI_98*counts MURI_21*counts MURI_30*counts MURI_102*counts > all3x3.counts
Normalize (only for mapped reads in this case)
Expand
titleAnalyze the count data

Now, we'll switch from running bash commands to running commands within the R statistical package.

Move into the "finaldata" directory and start R like this:

Code Block
cd finaldata
R

You should now see a ">" prompt instead of your linux prompt, telling you that you are now in an R shell, not a bash shell. (You type "q()" to exit the R shell).

Load some libraries and the raw data and do some basic transforms of the raw data to get it ready for analysis in R

Code Block
library(ggbiplot)
wall<-read.table(file="all3x3.counts",sep="\t",header=FALSE); load the raw data into variable "wall"
wallt<-t(wall[,2:7]); # This just transposes the read count data into a new variable, wallt

 

Check Principle Components Analysis (PCA)

Code Block
wallt.pca<-prcomp(wallt)
print(ggbiplot(wallt.pca, groups=c("4hr","4hr","4hr","24hr","24hr","24hr"), ellipse = TRUE, circle=TRUE, obs.scale = 1, var.scale = 1, var.axes=FALSE))

To view the plot you just created, go to the "Data" tab in Appsoma, navigate to your scratch/finaldata area and box plots

Calculate fold-change

Log transform

Volcano plotdownload the Rplots.pdf file.

Check a box plot

Code Block
boxplot(wall)

That's not very interesting or useful - we're plotting gene expression data on a linear scale! Let's go to log scale, fixing some issues with the raw data that would throw off the log calculation:

Code Block
boxplot(log(wall[,2:7]))

 

 

 

For your homework, you will investigate the validity of combining data files from different sequencing runs.  Only a few of these questions require working at a computer keyboard, but I encourage you to work in groups to solve the entire set of questions.

...