Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Anchor
Setup
Setup
Setup

Logon and

...

idev

First login to stampede2 ls6 like you did before. Then start an idev session so that we don't do too much processing on the login nodes.

Code Block
languagebash
titleStart an idev session
idev -p normal -m 60 -A UT-2015-05-18OTH21164 -N 1 -n 68 --reservation=BIO_DATA_week_1CoreNGSday2

Data staging

Set ourselves up to process some yeast data data in $SCRATCH, using some of best practices for organizing our workflow.

Code Block
languagebash
titleSet up directory for working with FASTQs
# Create a $SCRATCH area to work on data for this course,
# with a sub-directory for pre-processing raw fastq files
mkdir -p $SCRATCH/core_ngs/fastq_prep

# Make symbolic links to the original yeast data:
cd $SCRATCH/core_ngs/fastq_prep
ln -s -f $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -s -f $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz

# or
ln -s -f ~/CoreNGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -s -f ~/CoreNGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz

# or
ln -s -f /work2work/projects/BioITeam/projects/courses/Core_NGS_Tools/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -s -f /work2work/projects/BioITeam/projects/courses/Core_NGS_Tools/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz

...

Code Block
languagebash
titlels options to see the size of linked files
# the -l options says "long listing" which shows where the link goes,
# but doesn't show details of the real file
ls -l

# the -L option says to follow the link to the real file, 
#     -l means long listing (includes size) 
# and    -h says "human readable" (e.g. MB, GB)
ls -Llh

...

  • For each base, an integer Phred-type quality score is calculated as integer score = -10 log(probabilty base is wrong) then added to 33 to make a number in the Ascii printable character range.
  • As you can see from the table below, alphabetical letters - good, numbers – ok, most special characters – bad (except :;<=>?@).
  • See https://www.asciitable.com

Image RemovedImage Added

See the Wikipedia FASTQ format page for more information.

...

Warning

Both gzip and gunzip are extremely I/O intensive when run on large files.

While TACC has tremendous compute resources and the Lustre parallel file system is great, it has its limitations. It is not difficult to overwhelm the Lustre file system if you gzip or gunzip more than a few files at a time (as few as 3-4).!

The intensity of compression/decompression operations is another reason you should compress your sequencing files once (if they aren't already) then leave them that way.

...

For a really quick peek at the first few lines of your data, there's nothing like the head command. By default head displays the first 10 lines of data from the file you give it or from its standard input. With an argument -NNN (that is a dash followed by some number), it will show that many lines of data.

...

The yang to head's ying is tail, which by default it displays the last 10 lines of its data, and also uses the -NNN syntax to show the last NNN lines. (Note that with very large files it may take a while for tail to start producing output because it has to read through the file sequentially to get to the end.)

But what's really cool about tail is its -n +NNN syntax. This displays all the lines starting at line NNN. Note this syntax: the -n option switch follows by a plus sign ( + ) in front of a number – the plus sign is what says "starting at this line"! Try these examples:

Code Block
languagebash
titleUsing the tail command
# shows the last 10 lines
tail small.fq

# shows the last 100 lines -- might want to pipe this to more to see a bit at a time
tail -100 small.fq | more

# shows all the lines starting at line 900 -- better pipe it to a pager!
# cat -n adds line numbers to its output so we can see where we are in the file
cat -n small.fq | tail -n +900 | more

# shows 15 lines starting at line 900 because we pipe to head -15
tail -n +900 small.fq | head -15

zcat and gunzip -c

...

tricks

Ok, now you know how to navigate an un-compressed file using head and tail, more or less. But what if your FASTQ file has been compressed by gzip? You don't want to un-compress the file, remember?

...

Here fname is the name I gave the variable that is assigned a different file generated by the filename wildcard matching, each time through the loop. The actual file is then referenced as$fname inside the loop.

Note the general structure of the for loop. Different portions of the structure can be separated on different lines (like <something> and <something else> below) or put on one line separated with a semicolon ( ; ) like before the do keyword below.

Code Block
languagebash
for <variable name> in <expression>; do 
  <something>
  <something else>
done

Each time through the for loop, the next item in the evaluation expression (here *.gz) is assigned to the formal argument (here fname). Note that items in the evaluation expression are separated by spaces – not Tabs.

Tip

The bash shell lets you put multiple commands on one line if they are each separated by a semicolon ( ; ). So in the above for loop, you can see that bash considers the do keyword to start a separate command. Two alternate ways of writing the loop are:

Code Block
languagebash
# One line for each clause, no semicolons
for <variable name> in <expression>
do 
  <something>
 ; <something else>
done


Code Block
languagebash
# All on one line, with semicolons separating clauses
for <variable name> in <expression>; do <something>; <something else>; done


...