...
Warning | ||
---|---|---|
When following along here, please start an idev session for running any example commands:
|
Table of Contents |
---|
Illumina sequence data format (FASTQ)
...
Code Block | ||
---|---|---|
| ||
@HWI-ST1097:104:D13TNACXX:4:1101:1715:2142 1:N:0:CGATGT
GCGTTGGTGGCATAGTGGTGAGCATAGCTGCCTTCCAAGCAGTTATGGGAG
+
=<@BDDD=A;+2C9F<CB?;CGGA<<ACEE*1?C:D>DE=FC*0BAG?DB6
|
...
Expand | ||
---|---|---|
| ||
Executing the command above reports that the 2nd sequence has ID = @SRR030257.2 HWI-EAS_4_PE-FC20GCB:6:1:407:767/1, and the sequence TAAGCCAGTCGCCATGGAATATCTGCTTTATTTAGC |
...
Code Block | ||
---|---|---|
| ||
wc -l $BI/ngs_course/intro_to_mapping/data/SRR030257_1.fastq
|
...
Code Block | ||
---|---|---|
| ||
gunzip -c $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz | wc -l
|
...
Expand | |||||
---|---|---|---|---|---|
| |||||
The bash shell has a really strange syntax for arithmetic: it uses a double-parenthesis operator. Go figure.
|
FASTQ Quality Assurance tools
...
Code Block | ||
---|---|---|
| ||
# setup
cds
mkdir fastqc_test
cd fastqc_test
cp $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz .
# running the program
$BI/bin/FastQC/fastqc Sample_Yeast_L005_R1.cat.fastq.gz
|
...
Expand | |||||
---|---|---|---|---|---|
| |||||
The Sample_Yeast_L005_R1.cat.fastq.gz file is what we analyzed, so FastQC created the other two items. Sample_Yeast_L005_R1.cat_fastqc is a directory (the "d" in "drwxrwxr-x"), so use ls Sample_Yeast_L005_R1.cat_fastqc to see what's in it. Sample_Yeast_L005_R1.cat_fastqc.zip is just a Zipped (compressed) version of the whole directory. |
...
Code Block | ||
---|---|---|
| ||
http://loving.corral.tacc.utexas.edu/bioiteam/yeast_stuff/Sample_Yeast_L005_R1.cat_fastqc/fastqc_report.html
|
...
Code Block | ||
---|---|---|
| ||
# setup
cds
mkdir samstat_test
cd samstat_test
cp $BI/ngs_course/intro_to_mapping/data/SRR030257_1.fastq .
# run the program
$BI/bin/samstat SRR030257_1.fastq
|
...
Code Block | ||
---|---|---|
| ||
http://loving.corral.tacc.utexas.edu/bioiteam/SRR030257_1.fastq.html
|
...
Code Block | ||
---|---|---|
| ||
# setup
cds
mkdir samstat_test2
cd samstat_test2
cp $BI/web/yeast_stuff/yeast_chip_sort.bam .
# run the program
$BI/bin/samstat yeast_chip_sort.bam
|
...
Code Block | ||
---|---|---|
| ||
http://loving.corral.tacc.utexas.edu/bioiteam/yeast_stuff/yeast_chip_sort.bam.html
|
...
Code Block | ||
---|---|---|
| ||
module spider fastx_toolkit
module load fastx_toolkit
|
...
No Format | ||
---|---|---|
| ||
gunzip -c $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz | fastx_trimmer -l 50 -Q 33 > trimmed.fq
|
...
Expand | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
You could supply the -z option like this:
Or you could gzip the output yourself:
|
...
Expand | |||||
---|---|---|---|---|---|
| |||||
Type fastx_ then tab to see their names
|
Adapter trimming
...
Code Block | ||
---|---|---|
| ||
cutadapt -m 22 -O 10 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
|
Code Block | ||
---|---|---|
| ||
cutadapt -m 22 -O 10 -a TGATCGTCGGACTGTAGAACTCTGAACGTGTAGA
|
...
Expand | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||
Please refer to https://wikis.utexas.edu/display/GSAF/Illumina+-+all+flavors for Illumina library adapter layout. The top strand, 5' to 3', of a read sequence looks like this.
The -a argument to cutadapt is documented as the "sequence of adapter that was ligated to the 3' end". So we care about the <Read 2 primer> for R1 reads, and the <Read 1 primer> for R2 reads. The "contaminent" for adapter trimming will be the <Read 2 primer> for R1 reads. There is only one Read 2 primer:
The "contaminent" for adapter trimming will be the <Read 1 primer> for R2 reads. However, there are three different Read 1 primers, depending on library construction:
Since R2 reads are the reverse complement of R1 reads, the R2 adapter contaminent will be the RC of the Read 1 primer used. For ChIP-seq libraries where reads come from both DNA strands, the TruSeq Read 1 primer is always used.
For RNAseq libraries, we use the small RNA sequencing primer as the Read 1 primer.
|
...
No Format | ||
---|---|---|
| ||
flexbar -n 1 --adapters adaptors.fna --source example.fastq --target example.ar --format fastq-sanger --adapter-threshold 2 --adapter-min-overlap 6 --adapter-trim-end RIGHT_TAIL
|
Code Block | ||
---|---|---|
| ||
>adaptor1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>adaptor2
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG
>adaptor1_RC
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
>adaptor2_RC
CAAGCAGAAGACGGCATACGAGATNNNNNNNNGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
|
Note that flexbar only searches for the exact sequences given (with options to allow for a given number of mismatches) not the reverse complement of those sequences therefore you must provide them yourself.
Trimmomatic
Trimmomatic offers similar options with the potential benefit that many illumina adaptor sequences are already "built-in". It is available here.