Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

SampleConditionReplicateSequencing RunsData Files
MURI_174 hr1SA13172MURI_17_SA13172_ATGTCA_L007
MURI_264 hr2SA14027MURI_26_SA14027_TTAGGC_L006
MURI_984 hr3SA14008MURI_98_SA14008_TTAGGC_L005, MURI_98_SA14008_TTAGGC_L006
MURI_2124 hr1SA13172MURI_21_SA13172_GTGGCC_L007
MURI_3024 hr2SA14027MURI_30_SA14027_CAGATC_L006
MURI_10224 hr3SA14008, SA14032MURI_102_SA14008_CAGATC_L005, MURI_102_SA14008_CAGATC_L006, MURI_102_SA14032_CAGATC_L006

...

In class demo

In class, we will explore and characterize the raw data.  

Expand

Log in to your appsoma.com account

Select the "Code" tab if you are not already there.

Select "Biolinux-03" from the drop-down menu to the right of the RUN button

Select "Shell"

 

Now, within the shell window, use some of the linux commands you've learned to move your self into a working directory (called "scratch") and link to the data:

Code Block
cd scratch
mkdir rawdata
cd rawdata;
ln -s /home/scott/e0/* . ;
cd ..;
mkdir finaldata;
cd finaldata;
ln -s /home/scott/e3/* .;
cd ..;

 

 

Here are some elements (programs & techniques) we may use (you will need some of these for the homework):

...

Expand
titleMapping data to a reference genome...

Mappers/aligners work by first creating a compact index of the reference genome.

Expand
titleIndexing a genome...
Code Block
bwa index REL606.fna

 

Then, you tell the mapper to map your raw data (the fastq.gz files) to the indexed reference genome.

Expand
titleMapping raw sequence data to an indexed genome...
Code Block
bwa mem -t 16 REL606.fna MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz MURI_21_SA13172_GTGGCC_L007_R2_001.fastq.gz | awk '{if ($3!="*") {print $0}}' | samtools view -b -S - > MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.bam

 

But in the true spirit of linux applications, this is only one modular step of the whole process. The output of the mapper is in text form, not binary, so it's big and slow to access. It's also in the order of the raw reads, not the genome, so accessing a genomic location is really slow. And we don't have any summary data about how the mapping process went yet (except for the log created during mapping).

So there are a series of common commands to post-process a run.

Code Block
samtools sort MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.bam MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted
samtools index MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam
samtools flagstat MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam > MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam.flagstat
samtools idxstats MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam > MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam.idxstats

 

And, just for fun, let's ask these tools to look for SNPs:

Code Block
samtools mpileup -Euf $3 $1.sorted.bam | bcftools view -bvcg - > $1.sorted.bam.raw.bcf &

(Note that there are LOTS of programs for finding SNPs... this happens to be a pretty good one that uses Bayesian statistics and is fast.)

 

Expand
titleOf course, a good coder would wrap all this into an executable bash file for re-use...
Code Block
#!/bin/bash
# $1 = input file base name fwd, 
# $2 = input file base name rev,
# $3 = reference file, already bwa indexed
echo "$0: Starting on `date`"
echo "BWA mem"
bwa mem -t 16 $3 $1 $2 | awk '{if ($3!="*") {print $0}}' | samtools view -b -S - > $1.bam
# echo "Samtools view: convert to binary"
# samtools view -b -S $1.sam > $1.bam
# rm $1.sam
echo "Samtools sort: sorting bam file"
samtools sort $1.bam $1.sorted
# rm $1.bam
echo "Samtools index: creating index of sorted bam file"
samtools index $1.sorted.bam
samtools flagstat $1.sorted.bam > $1.sorted.bam.flagstat
samtools idxstats $1.sorted.bam > $1.sorted.bam.idxstats
# echo "Samtools mpileup and bcftools"
samtools mpileup -Euf $3 $1.sorted.bam | bcftools view -bvcg - > $1.sorted.bam.raw.bcf &
echo "$0: Done on `date`"

 

 

 

 

Expand
titleAssessing the read mapping...

Go take a look at some of the .flagstat files!

...

Expand
titleAnalyze the count data

Now, we'll switch from running bash commands to running commands within the R statistical package.

Move into the "finaldata" directory and start R like this:

Code Block
cd finaldata
R

You should now see a ">" prompt instead of your linux prompt, telling you that you are now in an R shell, not a bash shell. (You type "q()" to exit the R shell).

Load some libraries and the raw data and do some basic transforms of the raw data to get it ready for analysis in R

Code Block
library(ggbiplot)
wall<-read.table(file="all3x3.counts",sep="\t",header=FALSE); load the raw data into variable "wall"
wallt<-t(wall[,2:7]); # This just transposes the read count data into a new variable, wallt

 

Check Principle Components Analysis (PCA)

Code Block
wallt.pca<-prcomp(wallt)
print(ggbiplot(wallt.pca, groups=c("4hr","4hr","4hr","24hr","24hr","24hr"), ellipse = TRUE, circle=TRUE, obs.scale = 1, var.scale = 1, var.axes=FALSE))

To view the plot you just created, go to the "Data" tab in Appsoma, navigate to your scratch/finaldata area and download the Rplots.pdf file.

Check a box plot

Code Block
boxplot(wall)

That's not very interesting or useful - we're plotting gene expression data on a linear scale! Let's go to log scale, fixing some issues with the raw data that would throw off the log calculation:

Code Block
boxplot(log(wall[,2:7]))

 

 

...

Homework

For your homework, you will investigate the validity of combining data files from different sequencing runs.  Only a few of these questions require working at a computer keyboard, but I encourage you to work in groups to solve the entire set of questions.

...