Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

Before you start the alignment and analysis processes, it can be useful to perform some initial quality checks on your raw data. If you don't do this (or even if you do), you may notice later that something looks fishy in the the output: for example, many of your reads are not mapping or the ends of many of your reads do not align. Both can give you clues about whether you need to process the reads to improve the quality of data that you are putting into your analysis.

...

Expand
titleClick here to see if you are correct...

From the OPTIONS: section of the idev help output:

-m     minutes            sets time in minutes (default: 30)

-r     reservation_name   requests use of a specific reservation

-A     account_name       sets account name (default: -A none)

So you requested an idev node for 180 minutes, using the reservation named CCBB_Day_1, and asked that it be charged to the account named UT-2015-05-18.

...

Expand
titleAlternative using grep

grep or Global Regular Expression Print can also be used to determine the number of lines which match some criteria. Since we know the 3rd line in the fastq file is a + and a + only, we can look for a line that only has a + in it, and use that number to determine the number of sequence blocks in the file.


Code Block
languagebash
titlegrep example
grep -c "^+$" $BI/gva_course/mapping/data/SRR030257_2.fastq

the -c option tells grep to count the lines (rather than printing them all to the screen and tell you how many it found. The characters between the "" is what grep is looking for. The ^ symbol means, look for the beginning of the line, the $ symbol means look for the end of the line. Once again you see this returns 3800180 reads.




While checking the number of reads a file has can solve some of the most basic problems, it doesn't really provide any direct evidence as to the quality of the sequencing data. To get this type of information before starting meaningful analysis other programs must be used.

...

Cutadapt provides a simple command line tool for manipulating fasta and fastq files. The program description on their website provides good details of all the capabilities and examples for some common tasks. Cutadapt is also available via the TACC module system allowing us to turn it on when we need to use it and not worry about it other times.

Code Block
titleFASTX_toolkit cutadapt module description
module spider cutadapt
module load cutadapt

...