Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

Before you start the alignment and analysis processes, it can be useful to perform some initial quality checks on your raw data. If you don't do this (or even if you do), you may notice later that something looks fishy in the the output: for example, many of your reads are not mapping or the ends of many of your reads do not align. Both can give you clues about whether you need to process the reads to improve the quality of data that you are putting into your analysis.

...

Expand
titleClick here to see if you are correct...

From the OPTIONS: section of the idev help output:

-m     minutes            sets time in minutes (default: 30)

-r     reservation_name   requests use of a specific reservation

-A     account_name       sets account name (default: -A none)

So you requested an idev node for 180 minutes, using the reservation named CCBB_Day_1, and asked that it be charged to the account named UT-2015-05-18.

...

Expand
titleAlternative using grep

grep or Global Regular Expression Print can also be used to determine the number of lines which match some criteria. Since we know the 3rd line in the fastq file is a + and a + only, we can look for a line that only has a + in it, and use that number to determine the number of sequence blocks in the file.


Code Block
languagebash
titlegrep example
grep -c "^+$" $BI/gva_course/mapping/data/SRR030257_2.fastq

the -c option tells grep to count the lines (rather than printing them all to the screen and tell you how many it found. The characters between the "" is what grep is looking for. The ^ symbol means, look for the beginning of the line, the $ symbol means look for the end of the line. Once again you see this returns 3800180 reads.




While checking the number of reads a file has can solve some of the most basic problems, it doesn't really provide any direct evidence as to the quality of the sequencing data. To get this type of information before starting meaningful analysis other programs must be used.

...

Code Block
titleOne possible solution
collapsetrue
cutadapt -l 34 -o SRR030257_1.trimmed.fastq SRR030257_1.fastq
  • The -l 34 option says that base 34 should be the last base (i.e., trim down to 34 bases)
  • The -o sets the output file, in this case SRR030257_1.trimmed.fastq
  • Listing the input file without any option in front of it (SRR030257_1.fastq) is a common way to specify input files.

Exercise: compressing the trimmed file 

Compressed files are smaller, easier to transfer, and many programs allow for their use directly. How would you tell fastx_trimmercutadapt to compress (gzip) its output file?

code
Expand
titleHint

Type fastx_trimmer cutadapt -h to see program documentation

and look for information about compressed files

cat SRR030257_1.fastq | fastx_trimmer
Expand
titlePossible solution using the -z option
collapsetrue
key portion of help

Above the citation you see a paragraph that starts:

Input may also be in FASTA format. Compressed input and output is supported and auto-detected from the file name (.gz, .xz, .bz2)

So simply by adding .gz to the output file name, cutadapt will compress it after it does the trimming.

Code Block
titlePossible solution using the program directly
collapsetrue
cutadapt -l 34 -zo > SRR030257_1.trimmed.fastq.gz SRR030257_1.fastq
Code Block
titlePossible solution using gzip yourself
collapsetrue
cat SRR030257_1.fastq | fastx_trimmer -l 34 | gzip > SRR030257_1.trimmed.fastq.gz

Both of the above solutions give the same final product, but are clearly achieved in different ways. This is done to show you that data analysis is a results driven process, if the result is correct, and you know how you got the result it is correct as long as it is reproducablereproducible.

Adapter trimming

 As mentioned above,  fastx_clipper cutadapt can be used to trim specific sequences, and based on our fastqc analysis, the sequence AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA is significantly overrepresented in our data. How would you use fastx_clippercutadapt to remove those sequences from the fastq file?

 

Expand
titleHint

Type fastx_clipper -h to see program documentationAgain, we go back to the program documentation to find what we are looking for: cutadapt -h

Look below the possible solution for more detailed information on what to focus on if you cant find what you are looking for .

Code Block
titlePossible solution
collapsetrue
fastx_clipper -i SRR030257_1.trimmed.fastq cutadapt -o SRR030257_1.trimmed.depleted.fastq -a AAAAAAAAAAAAAAAAAAAA -l 34 -n m 16 SRR030257_1.fastq

 

Command portionpurpose
-i SRR030257_1.trimmed.fastquse this file as input-o SRR030257_1.trimmed.depleted.fastqcreate this new output file

-a AAAAAAAAAAAAAAAAA

remove bases containing this sequence
-l 34l 34trim reads to 34 bases
-m 16discard any read shorter than 34 16 bases after sequence removed -nkeep reads containing "N" bases in them. Consider how they are treated in downstream applicationsas these are more likely difficult to uniquely align to the genome
SRR030257_1.fastquse this file as input

From the summary printed to the screen you can see that this removed a little over an additional 2.2M bp of sequence.

A note on versions 

In our first tutorial we mentioned how knowing what version of a program you are using can be. When we loaded the the cutadapt module we didn't specify what version to load. Can you figure out what version you used, and what the most recent version of the program there is? .

...