...
In class, we will explore and characterize the raw data. Here
Expand |
---|
Log in to your appsoma.com account Select the "Code" tab if you are not already there. Select "Biolinux-03" from the drop-down menu to the right of the RUN button Select "Shell" Now, within the shell window, use some of the linux commands you've learned to move your self into a working directory (called "scratch") and link to the data: Code Block |
---|
cd scratch
mkdir rawdata
cd rawdata;
ln -s /home/scott/e0/* . ;
cd ..;
mkdir finaldata;
cd finaldata;
ln -s /home/scott/e3/* .;
cd ..; |
|
Here are some elements (programs & techniques) we may use (you will need some of these for the homework):
Expand |
---|
title | Assessing the raw data... |
---|
|
Is this E. coli? Is this E. coli RNA? Does this look like RNA-seq data? Expand |
---|
title | One way to do this... |
---|
| Get some sequences - get rid of any that show the Illumina adaptor "GATCGGAAGAGCACAC", and look for sequences that are abundant (the sort | uniq | sort trick): Code Block |
---|
gunzip -c MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz | egrep [ACGT]\{101\} | grep -v GATCGGAAGAGCACAC | head -10000 | sort | uniq -c | sort -n -r | head |
Now blast a few of those against NT at NCBI's blast site and convince yourself your culture was not contaminated! |
Assessing the quality of the raw data: Expand |
---|
title | One way to do this... |
---|
| FastQC is a handy tool to look at aggregate measures of sequence quality, though it has some significant caveats, especially for NGS data. Here is how you run it: Code Block |
---|
fastqc MURI_102_SA14008_CAGATC_L005_R1_001.fastq.gz |
But analyzing the entire file has two problems: 1) it's really slow and 2) the answer you get will depend on the size of the file. So a better way is to take a uniform and small-ish amount of data and assess with fastqc: Code Block |
---|
gunzip -c MURI_102_SA14008_CAGATC_L006_R1_001.fastq.gz | head -400000 > r1.fq
fastqc r1.fq |
Then look in the "r1.fq_fastqc" directory to explore the results. |
|
Expand |
---|
title | Mapping data to a reference genome... |
---|
|
To map a pair of data files to the reference E. coli genome: Mappers/aligners work by first creating a compact index of the reference genome. Expand |
---|
title | Indexing a genome... |
---|
| Code Block |
---|
bwa index REL606.fna |
|
Then, you tell the mapper to map your raw data (the fastq.gz files) to the indexed reference genome. Expand |
---|
title | Mapping raw sequence data to an indexed genome... |
---|
| Code Block |
---|
bwa mem -t 16 REL606.fna MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz MURI_21_SA13172_GTGGCC_L007_R2_001.fastq.gz | awk '{if ($3!="*") {print $0}}' | samtools view -b -S - > MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.bam |
|
But in the true spirit of linux applications, this is only one modular step of the whole process. The output of the mapper is in text form, not binary, so it's big and slow to access. It's also in the order of the raw reads, not the genome, so accessing a genomic location is really slow. And we don't have any summary data about how the mapping process went yet (except for the log created during mapping). So there are a series of common commands to post-process a run. Code Block |
---|
samtools sort MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.bam MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted
samtools index MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam
samtools flagstat MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam > MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam.flagstat
samtools idxstats MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam > MURI_21_SA13172_GTGGCC_L007_R1_001.fastq.gz.sorted.bam.idxstats |
And, just for fun, let's ask these tools to look for SNPs: Code Block |
---|
samtools mpileup -Euf $3 $1.sorted.bam | bcftools view -bvcg - > $1.sorted.bam.raw.bcf & |
(Note that there are LOTS of programs for finding SNPs... this happens to be a pretty good one that uses Bayesian statistics and is fast.) Expand |
---|
title | Of course, a good coder would wrap all this into an executable bash file for re-use... |
---|
| Code Block |
---|
#!/bin/bash
# $1 = input file base name fwd,
# $2 = input file base name rev,
# $3 = reference file, already bwa indexed
echo "$0: Starting on `date`"
echo "BWA mem"
bwa mem -t 16 $3 $1 $2 | awk '{if ($3!="*") {print $0}}' | samtools view -b -S - > $1.bam
# echo "Samtools view: convert to binary"
# samtools view -b -S $1.sam > $1.bam
# rm $1.sam
echo "Samtools sort: sorting bam file"
samtools sort $1.bam $1.sorted
# rm $1.bam
echo "Samtools index: creating index of sorted bam file"
samtools index $1.sorted.bam
samtools flagstat $1.sorted.bam > $1.sorted.bam.flagstat
samtools idxstats $1.sorted.bam > $1.sorted.bam.idxstats
# echo "Samtools mpileup and bcftools"
samtools mpileup -Euf $3 $1.sorted.bam | bcftools view -bvcg - > $1.sorted.bam.raw.bcf &
echo "$0: Done on `date`" |
|
|
Expand |
---|
title | Assessing the read mapping... |
---|
|
Assessing the read-mapping: Go take a look at some of the .flagstat files! |
Expand |
---|
title | Visualize the mapped data... |
---|
|
The sorted.bam and sorted.bam.bai files can be used (along with the reference of course) with a mapping visualization program. There are several, but I'll be showing you the Broad Institute's Integrative Genomics Viewer (IGV). Steps (I'm showing you how to do this from scratch on your own LOCAL computer - there are other ways...): - Download and install IGV according to their instructions.
- Get all the .sorted.bam and .sorted.bam.bai files, along with the REL606.fna and REL606.gff reference genome information from biolinux-03 over to your local computer. For our course, I've posted these here at TACC.
- Start IGV
- Select, "Genomes" -> "Create .genome file" and pass it the REL606.fna file and the REL606.gff file as prompted.
- Select, "File" -> "Load from file" and select the .sorted.bam files (IGV will automatically use the index .bai files but they MUST be there).
- Surf!
|
Expand |
---|
title | Counting reads per gene... (also called "count" data) |
---|
| Counting the number of sequence reads within a gene, for all genes in the genome: |
OK, now we have reads aligned to a genome. How can we tell which genes those reads belonged to? We need to use the gene annotations for that genome of course! That's the REL606.gff file. One way to do this is to use a program which simply intersects features in the gene annotation file (i.e. genes) with the piled-up reads |
Expand |
---|
title | Analyze the count data |
---|
|
Normalize (only for mapped reads in this case) Check Principle Components Analysis (PCA) and box plots Calculate fold-change Log transform Volcano plot |
...