...
Line 1 is the unique read name. The format for Illunia Illumina reads is as follows, using the read name above:
...
@HWI-ST1097:127:C0W5VACXX:5:1101:4820:2124 1:N:0:CTCAGA
- The line as a whole will be unique for this read fragment. However,
- the corresponding R1 and R2 reads will have identical machine_id:lane:flowcell_grid_coordinates information. This common part of the name ties the two read ends together
- the end_number:failed_qc:0:barcode information will be different for R1 and R2
- Most sequencing facilities will not give you qc-failed reads (failed_qc = Y) unless you ask for them.
Line 2 is the sequence reported by the machine, starting with the first base of the insert (the 5' adapter has usually been removed by the sequencing facility). These are ACGT or N uppercase characters.
Line 3 is always starts with '+' from GSAF (it can optionally include a sequence description)
...
Exercise: What character in the quality score string in the fastq FASTQ entry above represents the best base quality? Roughly what is the error probability estimated by the sequencer?
...
The most common compression program used for individual files is gzipand its counterpart gunzip whose compressed files have the .gz extension. The tar and zip programs are most commonly used for compressing directories.
...
Code Block | ||||
---|---|---|---|---|
| ||||
ls -lh $CORENGS/yeast_stuff/*L005*.fastq ls -lh $CORENGS/yeast_stuff/*L005*.fastq.gz |
Tip | ||
---|---|---|
| ||
The asterisk character ( * ) is a pathname wildcard that matches 0 or more characters. Read more about pathname wildcards here: Pathname wildcards and special characters |
...
Expand | ||
---|---|---|
| ||
FASTQ's are ~ 150 MB |
You may be tempted to want to un-compress your sequencing files in order to manipulate them more directly – but resist that temptation! Nearly all modern bioinformatics tools are able to work on .gz files, and there are tools and techniques for working with the contents of compressed files without ever un-compressing them.
...
Code Block | ||||
---|---|---|---|---|
| ||||
# shows the last 10 lines tail small.fq # shows the last 100 lines -- might want to pipe this to more to see a bit at a time tail -100 small.fq | more # shows all the lines starting at line 900 -- better pipe it to a pager! # tailcat -n +900 adds line numbers to its output so we can see where we are in the file cat -n small.fq | tail -n +900 | more # shows 15 lines starting at line 900 because we pipe to head -15 tail -n +900 small.fq | head -15 |
...
Code Block | ||||
---|---|---|---|---|
| ||||
# make sure you're in your $SCRATCH/core_ngs/fastq_prep directory cd $SCRATCH/core_ngs/fastq_prep gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | more gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | head gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | tail gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | tail -n +900901 | head -158 # Note that less will display .gz file contents automatically less Sample_Yeast_L005_R1.cat.fastq.gz |
...
Code Block | ||||
---|---|---|---|---|
| ||||
zcat Sample_Yeast_L005_R1.cat.fastq.gz | more zcat Sample_Yeast_L005_R1.cat.fastq.gz | head zcat Sample_Yeast_L005_R1.cat.fastq.gz | tail zcat Sample_Yeast_L005_R1.cat.fastq.gz | tail -n +900901 | head -158 |
Tip |
---|
There will be times when you forget to pipe your large zcat or gunzip -c output somewhere – even the experienced among us still make this mistake! This leads to pages and pages of data spewing across your terminal. If you're lucky you can kill the output with Ctrl-c. But if that doesn't work (and often it doesn't) just close your Terminal window. This terminates the process on the server (like hanging up the phone), then you just can log back in. |
...
Here's how you would combine this math expression with zcat line counting on your file using the magic of backtick evaluation. Notice that the wc -l expression is what is reading from standard input – not echo.
Code Block | ||||
---|---|---|---|---|
| ||||
cd $SCRATCH/core_ngs/fastq_prep zcat Sample_Yeast_L005_R1.cat.fastq.gz | echo "$((`wc -l` / 4))" |
...
In the code below we pipe the output from wc -l (number of lines in the FASTQ file) to awk, which executes its body (the statements between the curly braces ( { } ) for each line of input. Here the input is just one line, with one field – the line count. The awk body just divides the 1st input field ($1) by 4 and writes the result to standard output. (Read more about awk in Advanced commands: awk)
...
Code Block | ||||
---|---|---|---|---|
| ||||
for fname in *.gz; do
echo "Processing $fname"
echo "...$fname has $((`zcat $fname | wc -l` / 4)) sequences"
done |
...