Page Comparison

Overview

Before you start the alignment and analysis processes, it can be useful to perform some initial quality checks on your raw data. If you don't do this (or even if you do), you may notice later that something looks fishy in the the output: for example, many of your reads are not mapping or the ends of many of your reads do not align. Both can give you clues about whether you need to process the reads to improve the quality of data that you are putting into your analysis.

...

Expand

title	Click here to see if you are correct...

From the OPTIONS: section of the idev help output:

-m minutes sets time in minutes (default: 30)

-r reservation_name requests use of a specific reservation

-A account_name sets account name (default: -A none)

So you requested an idev node for 180 minutes, using the reservation named CCBB_Day_1, and asked that it be charged to the account named UT-2015-05-18.

...

Expand

title	Alternative using grep

grep or Global Regular Expression Print can also be used to determine the number of lines which match some criteria. Since we know the 3rd line in the fastq file is a + and a + only, we can look for a line that only has a + in it, and use that number to determine the number of sequence blocks in the file.

Code Block

language	bash
title	grep example

grep -c "^+$" $BI/gva_course/mapping/data/SRR030257_2.fastq

the -c option tells grep to count the lines (rather than printing them all to the screen and tell you how many it found. The characters between the "" is what grep is looking for. The ^ symbol means, look for the beginning of the line, the $ symbol means look for the end of the line. Once again you see this returns 3800180 reads.

While checking the number of reads a file has can solve some of the most basic problems, it doesn't really provide any direct evidence as to the quality of the sequencing data. To get this type of information before starting meaningful analysis other programs must be used.

...

Code Block

title	One possible solution
collapse	true

cutadapt -l 34 -o SRR030257_1.trimmed.fastq SRR030257_1.fastq

The -l 34 option says that base 34 should be the last base (i.e., trim down to 34 bases)
The -o sets the output file, in this case SRR030257_1.trimmed.fastq
Listing the input file without any option in front of it (SRR030257_1.fastq) is a common way to specify input files.

Exercise: compressing the trimmed file

Compressed files are smaller, easier to transfer, and many programs allow for their use directly. How would you tell fastx_trimmercutadapt to compress (gzip) its output file?

code

Expand

title	Hint

Type fastx_trimmer cutadapt -h to see program documentation

and look for information about compressed files

cat SRR030257_1.fastq | fastx_trimmer

Expand

title	Possible solution using the -z option
collapse	true

key portion of help

Above the citation you see a paragraph that starts:

Input may also be in FASTA format. Compressed input and output is supported and auto-detected from the file name (.gz, .xz, .bz2)

So simply by adding .gz to the output file name, cutadapt will compress it after it does the trimming.

Code Block

title	Possible solution using the program directly
collapse	true

cutadapt -l 34 -zo > SRR030257_1.trimmed.fastq.gz SRR030257_1.fastq

Code Block

title	Possible solution using gzip yourself
collapse	true

cat SRR030257_1.fastq | fastx_trimmer -l 34 | gzip > SRR030257_1.trimmed.fastq.gz

Both of the above solutions give the same final product, but are clearly achieved in different ways. This is done to show you that data analysis is a results driven process, if the result is correct, and you know how you got the result it is correct as long as it is reproducablereproducible.

Adapter trimming

As mentioned above, fastx_clipper cutadapt can be used to trim specific sequences, and based on our fastqc analysis, the sequence AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA is significantly overrepresented in our data. How would you use fastx_clippercutadapt to remove those sequences from the fastq file?

Expand

title	Hint

Type fastx_clipper -h to see program documentationAgain, we go back to the program documentation to find what we are looking for: cutadapt -h

Look below the possible solution for more detailed information on what to focus on if you cant find what you are looking for .

Code Block

title	Possible solution
collapse	true

fastx_clipper -i SRR030257_1.trimmed.fastq cutadapt -o SRR030257_1.trimmed.depleted.fastq -a AAAAAAAAAAAAAAAAAAAA -l 34 -n m 16 SRR030257_1.fastq

Command portion	purpose
-i SRR030257_1.trimmed.fastq	use this file as input	-o SRR030257_1.trimmed.depleted.fastq	create this new output file
-a AAAAAAAAAAAAAAAAA	remove bases containing this sequence
-l 34l 34	trim reads to 34 bases
-m 16	discard any read shorter than 34 16 bases after sequence removed	-n	keep reads containing "N" bases in them. Consider how they are treated in downstream applicationsas these are more likely difficult to uniquely align to the genome
SRR030257_1.fastq	use this file as input

From the summary printed to the screen you can see that this removed a little over an additional 2.2M bp of sequence.

A note on versions

In our first tutorial we mentioned how knowing what version of a program you are using can be. When we loaded the the cutadapt module we didn't specify what version to load. Can you figure out what version you used, and what the most recent version of the program there is? .

...

Versions Compared

Old Version 6

New Version 7

Key

Overview

Adapter trimming

A note on versions