Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titleStart an idev session
idev -m 120180 -N 1 -A OTH21164 -r CoreNGS-Tue       # or -A TRA23004 
# -or-
idev -m 90120 -N 1 -A OTH21164 -p development   # or -A TRA23004 

Data staging

Set ourselves up to process some yeast data data in $SCRATCH, using some of best practices for organizing our workflow.

Code Block
languagebash
titleSet up directory for working with FASTQs
# Create a $SCRATCH area to work on data for this course,
# with a sub-directory for pre-processing raw fastq files
mkdir -p $SCRATCH/core_ngs/fastq_prep

# Make symbolic links to the original yeast data:
cd $SCRATCH/core_ngs/fastq_prep
ln -s -f $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -s -f $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz

# or
ln -s -f ~/CoreNGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -s -f ~/CoreNGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz

# or
ln -sf /work/projects/BioITeam/projects/courses/Core_NGS_Tools/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -sf /work/projects/BioITeam/projects/courses/Core_NGS_Tools/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz

...

Exercise: What character in the quality score string in the FASTQ entry above represents the best base quality? Roughly what is the error probability estimated by the sequencer?

Expand
titleAnswer

J is the best base quality score character (Q=41)

It represents a probability of error of <= 1/10^4.1 or < 1/10,000

About compressed files

...

Tip
titlePathname wildcarding

The asterisk character ( * ) is a pathname wildcard that matches 0 or more characters.

(Read more about pathname wildcards: Pathname wildcards)

Exercise: About how big are the compressed files? The uncompressed files? About what is the compression factor?

...

Warning

Both gzip and gunzip are extremely I/O intensive when run on large files.

While TACC has tremendous compute resources and its specialized parallel file system is great, it has its limitations. It is not difficult to overwhelm the TACC file system if you gzip or gunzip more than a few files at a time – as few as 5-6!6!

See https://docs.tacc.utexas.edu/tutorials/managingio/ for TACC's guidelines for managing I/O.

The intensity of compression/decompression operations is another reason you should compress your sequencing files once (if they aren't already) then leave them that way.

...

One of the challenges of dealing with large data files, whether compressed or not, is finding your way around the data – finding and looking at relevant pieces of it. Except for the smallest of files, you can't open them up in a text editor because those programs read the whole file into memory, so will choke on sequencing data files! Instead we use various techniques to look at pieces of the files at a time. (Read more about commands for Displaying file contents)

The first technique is the use of pagers – we've already seen this with the more command. Review its use now on our small uncompressed file:

...

If you start less with the -N option, it will display line numbers Numbers. Using the -I options allows pattern searches to be case-Insensitive.

Exercise: What line of small.fq contains the read name with grid coordinates 2316:10009:100563?

...

Expand
titleHint

cat --help | more


Expand
titleAnswer


Code Block
languagebash
cat -n small.fq | tail


...

The pipe operator ( | ) connects one program's standard output to the next program's standard input. The power of the Linux command line is due in no small part to the power of piping. (Read more about about Piping and Standard Unix I/O streams)

...

  • When you execute the head -100 small.fq | more command, head starts writing lines of the small.fq file to standard output.
  • Because of the pipe, the output does not go to the Terminal, but is connected to the standard input  of the more command.
  • Instead of reading lines from a file you specify as a command-line argument, more obtains its input from standard input.
  • The more command writes a page of text to standard output, which is displayed on the Terminal.

...

But what's really cool about tail is its -n +NN syntax. This displays all the lines starting at line NN. Note this syntax: the -n option switch follows by a plus sign ( + ) in front of a number – the plus sign is what says "starting at this line"! Try these examples:

...

Code Block
languagebash
titleUsing the tail command
# shows the last 10 lines
tail small.fq

# shows the last 100 lines -- might want to pipe this to more to see a bit at a time
tail -100 small.fq | more

# shows all the lines starting at line 900 -- better pipe it to a pager!
# cat -n adds line numbers to its output so we can see where we are in the file
cat -n small.fq | tail -n +900 | more

# shows 15 lines starting at line 900 because we pipe to head -15
tail -n +900 small.fq | head -15

Read more about head and tail in Displaying file contents.

zcat and gunzip -c tricks

...

Code Block
languagebash
titleCounting lines with wc -l
zcat Sample_Yeast_L005_R1.cat.fastq.gz | more
zcat Sample_Yeast_L005_R1.cat.fastq.gz | less -N
zcat Sample_Yeast_L005_R1.cat.fastq.gz | head
zcat Sample_Yeast_L005_R1.cat.fastq.gz | tail
zcat Sample_Yeast_L005_R1.cat.fastq.gz | tail -n +901 | head -8

# include original line numbers
zcat Sample_Yeast_L005_R1.cat.fastq.gz | cat -n | tail -n +901 | head -8

...

Tip

There will be times when you forget to pipe your large zcat or gunzip -c output somewhere – even the experienced among us still make this mistake! This leads to pages and pages of data spewing across your Terminal.

If you're lucky you can kill the output with Ctrl-c. But if that doesn't work (and often it doesn't) just close your Terminal window. This terminates the process on the server (like hanging up the phone), then you just can log back in.

...

Code Block
languagebash
titleSyntax for artithmetic on the command line
echo $((2368720 / 4))

Here's another trick: backticks backtick evaluation. When you enclose a command expression in backtick quotes ( ` ) the enclosed expression is evaluated and its standard output substituted into the string. (Read more about Quoting in the shell)

Here's how you would combine this math expression with zcat line counting on your file using the magic of backtick evaluation. Notice that the wc -l expression is what is reading from standard input.

...

Warning
titlebash arithmetic is integer valued only

Note that arithmetic in the bash shell is integer valued only, so don't use it for anything that requires decimal places!

(Read more about Arithemetic in bash)

A better way to do math

Well, doing math in bash is pretty awful – there has to be something better. There is! It's called awk, which is a powerful scripting language that is easily invoked from the command line.

In the code below we pipe the output from wc -l (number of lines in the FASTQ file) to awk, which executes its body (the statements between the curly braces ( {  } ) for each line of input. Here the input is just one line, with one field – the line count. The awk body just divides the 1st input field ($1) by 4 and writes the result to standard output. (Read more about awk in Advanced Some Linux commands: awk)

Expand
titleSetup (if needed)


Code Block
languagebash
# Setup (if needed)
export CORENGS=/work/projects/BioITeam/projects/courses/Core_NGS_Tools 
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz


...

Note that $1 means something different in awk – the 1st whitespace-delimited input field – than it does in bash, where it represents the 1st argument to a script or function (technically, the 1 environment variable). This is an example of where a metacharacter- the dollar sign ( $ ) here – has  a different meaning for two different programs. (Read more about Literal characters and metacharacters)

The bash shell treats dollar sign ( $ ) as an evaluation operator, so will normally attempt to evaluate the environment variable name following the $ and substitute its value in the output (e.g. echo $SCRATCH). But we don't want that evaluation to be applied to the {print $1 / 4} script argument passed to awk; instead we want awk to see the literal string {print $1 / 4} as its script. To achieve this result we surround the script argument with single quotes ( ' ' ), which tells the shell to treat everything enclosed by the quotes as literal text, and not perform any metacharacter evaluationprocessing. (Read more about Quoting in the shell)

...

Each time through the for loop, the next item in the argument list (here the files matching the wildcard glob *.gz) is assigned to the for loop's formal argument (here the variable fname). The actual filename is then referenced as$fname inside the loop. (Read more about about Bash control flow)