This brief tutorial will walk you through data analysis of an RNA-seq experiment.
In this experiment, E. coli was inoculated into culture and the culture was then sampled at 4 hours and 24 hours post inoculation. The experiment was run in triplicate.
RNA was extracted from the 6 samples, fragmented, and sequenced. All sequencing runs were of the paired-end 2x100 type, so each RNA fragment is read from both ends, 100 bp from each end.
Here is a table showing the data we have:
Sample | Condition | Replicate | Sequencing Runs | Data Files |
---|
MURI_17 | 4 hr | 1 | SA13172 | MURI_17_SA13172_ATGTCA_L007 |
MURI_26 | 4 hr | 2 | SA14027 | MURI_26_SA14027_TTAGGC_L006 |
MURI_98 | 4 hr | 3 | SA14008 | MURI_98_SA14008_TTAGGC_L005, MURI_98_SA14008_TTAGGC_L006 |
MURI_21 | 24 hr | 1 | SA13172 | MURI_21_SA13172_GTGGCC_L007 |
MURI_30 | 24 hr | 2 | SA14027 | MURI_30_SA14027_CAGATC_L006 |
MURI_102 | 24 hr | 3 | SA14008, SA14032 | MURI_102_SA14008_CAGATC_L005, MURI_102_SA14008_CAGATC_L006, MURI_102_SA14032_CAGATC_L006 |
In class demo
In class, we will explore and characterize the raw data.
This is the first part of your homework due 11/25...
Log in to your appsoma.com account
Select the "Code" tab if you are not already there.
Select "Biolinux-03" from the drop-down menu to the right of the RUN button
Select "Shell"
Now, within the shell window, use some of the linux commands you've learned to move your self into a working directory (called "scratch") and link to the data:
cd scratch
mkdir rawdata
cd rawdata;
ln -s /home/scott/e0/* . ;
cd ..;
mkdir finaldata;
cd finaldata;
ln -s /home/scott/e3/* .;
cd ..;
Here are some elements (programs & techniques) we may use (you will need some of these for the homework):
Assessing the raw data...
Is this E. coli?
One way to do this...
Get some sequences - get rid of any that show the Illumina adaptor "GATCGGAAGAGCACAC", and look for sequences that are abundant (the sort | uniq | sort trick):
gunzip -c MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz | egrep [ACGT]\{101\} | grep -v GATCGGAAGAGCACAC | head -10000 | sort | uniq -c | sort -n -r | head
Now blast a few of those against NT at NCBI's blast site and convince yourself your culture was not contaminated!
Assessing the quality of the raw data:
One way to do this...
FastQC is a handy tool to look at aggregate measures of sequence quality, though it has some significant caveats, especially for NGS data.
Here is how you run it:
fastqc MURI_102_SA14008_CAGATC_L005_R1_001.fastq.gz
But analyzing the entire file has two problems: 1) it's really slow and 2) the answer you get will depend on the size of the file.
So a better way is to take a uniform and small-ish amount of data and assess with fastqc:
gunzip -c MURI_102_SA14008_CAGATC_L006_R1_001.fastq.gz | head -400000 > r1.fq
fastqc r1.fq
Then look in the "r1.fq_fastqc" directory to explore the results.
Mapping data to a reference genome...
Mappers/aligners work by first creating a compact index of the reference genome.
Then, you tell the mapper to map your raw data (the fastq.gz files) to the indexed reference genome.
Mapping raw sequence data to an indexed genome...
bwa mem -t 16 REL606.fna MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz MURI_21_SA13172_GTGGCC_L007_R2_001.fastq.gz | awk '{if ($3!="*") {print $0}}' | samtools view -b -S - > MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.bam
But in the true spirit of linux applications, this is only one modular step of the whole process. The output of the mapper is in text form, not binary, so it's big and slow to access. It's also in the order of the raw reads, not the genome, so accessing a genomic location is really slow. And we don't have any summary data about how the mapping process went yet (except for the log created during mapping).
So there are a series of common commands to post-process a run.
samtools sort MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.bam MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted
samtools index MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam
samtools flagstat MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam > MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam.flagstat
samtools idxstats MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam > MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam.idxstats
And, just for fun, let's ask these tools to look for SNPs:
samtools mpileup -Euf $3 $1.sorted.bam | bcftools view -bvcg - > $1.sorted.bam.raw.bcf &
(Note that there are LOTS of programs for finding SNPs... this happens to be a pretty good one that uses Bayesian statistics and is fast.)
Of course, a good coder would wrap all this into an executable bash file for re-use...
#!/bin/bash
# $1 = input file base name fwd,
# $2 = input file base name rev,
# $3 = reference file, already bwa indexed
echo "$0: Starting on `date`"
echo "BWA mem"
bwa mem -t 16 $3 $1 $2 | awk '{if ($3!="*") {print $0}}' | samtools view -b -S - > $1.bam
# echo "Samtools view: convert to binary"
# samtools view -b -S $1.sam > $1.bam
# rm $1.sam
echo "Samtools sort: sorting bam file"
samtools sort $1.bam $1.sorted
# rm $1.bam
echo "Samtools index: creating index of sorted bam file"
samtools index $1.sorted.bam
samtools flagstat $1.sorted.bam > $1.sorted.bam.flagstat
samtools idxstats $1.sorted.bam > $1.sorted.bam.idxstats
# echo "Samtools mpileup and bcftools"
samtools mpileup -Euf $3 $1.sorted.bam | bcftools view -bvcg - > $1.sorted.bam.raw.bcf &
echo "$0: Done on `date`"
Assessing the read mapping...
Go take a look at some of the .flagstat files!
Visualize the mapped data...
The sorted.bam and sorted.bam.bai files can be used (along with the reference of course) with a mapping visualization program. There are several, but I'll be showing you the Broad Institute's Integrative Genomics Viewer (IGV).
Steps (I'm showing you how to do this from scratch on your own LOCAL computer - there are other ways...):
- Download and install IGV according to their instructions.
- Get all the .sorted.bam and .sorted.bam.bai files, along with the REL606.fna and REL606.gff reference genome information from biolinux-03 over to your local computer. For our course, I've posted these here at TACC.
- Start IGV
- Select, "Genomes" -> "Create .genome file" and pass it the REL606.fna file and the REL606.gff file as prompted.
- Select, "File" -> "Load from file" and select the .sorted.bam files (IGV will automatically use the index .bai files but they MUST be there).
- Surf!
Counting reads per gene... (also called "count" data)
OK, now we have reads aligned to a genome. How can we tell which genes those reads belonged to?
We need to use the gene annotations for that genome of course! That's the REL606.gff file.
One way to do this is to use a program which simply intersects features in the gene annotation file (i.e. genes) with the piled-up reads. We'll use bedtools to do this.
bedtools coverage -abam MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam -b REL606.gff > MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam.bedtools.out
OK, but I'm lazy - I want the computer to do this on all 6 files please...
for files in `ls *.sorted.bam`; do bedtools coverage -abam $files -b REL606.gff > $files.bedtools.out; done
That's pretty good, but that output file isn't really what we want for concise gene expression measures. We'd like to skinny it down - remove columns we don't need and only look at CDS elements:
for file in `ls *.bedtools.out`; do cat $file | grep CDS | awk 'BEGIN {FS="\t"} {print $10}' > $file.counts ; done
And now I'd like to just have one table (Excel!!) with all the gene expression values. Here are some tricks to do that pretty efficiently:
cat MURI_17_SA13172_ATGTCA_L007_R1_001.fastq.gz.sorted.bam.bedtools.out | grep CDS | cut -f 9 | awk 'BEGIN {FS=";"} {print $3":"$4}' > genes.txt
paste genes.txt MURI_17*counts MURI_26*counts MURI_98*counts MURI_21*counts MURI_30*counts MURI_102*counts > all3x3.counts
Analyze the count data
Now, we'll switch from running bash commands to running commands within the R statistical package.
Move into the "finaldata" directory and start R like this:
You should now see a ">" prompt instead of your linux prompt, telling you that you are now in an R shell, not a bash shell. (You type "q()" to exit the R shell).
Load some libraries and the raw data and do some basic transforms of the raw data to get it ready for analysis in R
library(ggbiplot)
wall<-read.table(file="all3x3.counts",sep="\t",header=FALSE); load the raw data into variable "wall"
wallt<-t(wall[,2:7]); # This just transposes the read count data into a new variable, wallt
Check Principle Components Analysis (PCA)
wallt.pca<-prcomp(wallt)
print(ggbiplot(wallt.pca, groups=c("4hr","4hr","4hr","24hr","24hr","24hr"), ellipse = TRUE, circle=TRUE, obs.scale = 1, var.scale = 1, var.axes=FALSE))
To view the plot you just created, go to the "Data" tab in Appsoma, navigate to your scratch/finaldata area and download the Rplots.pdf file.
Check a box plot
That's not very interesting or useful - we're plotting gene expression data on a linear scale! Let's go to log scale, fixing some issues with the raw data that would throw off the log calculation:
Homework
For your homework, you will investigate the validity of combining data files from different sequencing runs. Only a few of these questions require working at a computer keyboard, but I encourage you to work in groups to solve the entire set of questions.
Based on what you learned about the T-test (that is, using terms associated with a T-test), explain what criteria you might use to consider it "invalid" to combine the multiple raw sequence data files from samples. (5 points)Outline the steps needed to reduce the raw data to numbers suitable for evaluation of your criteria in question #1. (5 points)Perform the steps you outlined in #2 and tell whether or not it was valid to combine the data files. (20 points)Starting with the raw "count" data, explore the effect on PCA of NOT normalizing. Turn in a print out of the new PCA plot. (10 points)Although we did not explore this in class, DNA mutations were automatically tallied during our mapping process. These results are in the files ending in ".bcf". Using the tool "bedtools" to view these results, test the hypothesis that transitions are more common than transversions. Support your answer with data from this experiment. (20 points)Continuing with the mutation analysis, examine whether the mutation frequency in this sample set differs between protein coding and non-protein coding regions of the genome. Support your answer with data from this experiment. (30 points).
Email your answers/PDFs to shunickesmith <at> gmail.com, cc: Prof. Matouschek no later than TBD, 10:00 am (BETTER: before Thanksgiving break).
Homework - Revised 9:00 pm Monday 11/24/14 Revised 5:00 pm Thursday 11/20/14
For your homework, you will investigate the validity of combining data files from different sequencing runs. Only a few of these questions require working at a computer keyboard, but I encourage you to work in groups to solve the entire set of questions.
Before you begin, pull up this web site in a new window - it's an all-class, live group chat to which you can post questions, get answers and even answer questions of your fellow students. Dr. Hunicke-Smith and Benni will be monitoring it periodically (largely during the daytime and early evenings).
By next Tuesday, 12/2 11/25, 10:00 am do the following:
- Follow the steps listed above on this web page under the expanding section "This is your homework due 11/25..." to log into appsoma and setup access to the data you will need from here on.
- Move into the "rawdata" directory, find the first four lines of the read 1 sequence file for the MURI 102 sample from lane 6 of sequencing run SA14008 - put them into a new file in that directory called "s1.fq" and copy it into an email.
- Move into the "finaldata" directory and make sure you can see the gene expression data file "all3x3.counts"
- Using the linux "sort" command, sort all3x3.counts 6 times, sorting on the expression values of each of the 6 samples separately from lowest to highest, redirecting the output of each sort into a separate file.
- Using the linux command "tail -1" on each of these 6 files, copy the name of the most abundant gene from each sample into the same email.
Remember - use Etherpad to ask questions and get answers! Email your answers to shunickesmith <at> gmail.com, cc: Prof. Matouschek no later than 11/25, 10:00 am.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.