...
Once a line has been read, it can be parsed, for example, using cut, as shown below. Other notes:c
- The double quotes around the text that "$line" are important to preserve special characters inside the original line (here tab characters).
- Without the double quotes, the line fields would be separated by spaces, and the cut field delimiter would need to be changed.
- Some lines have an empty job name field; we replace job and sample names in this case.
- We assign file descriptor 4 to the file data being read (
4< sampleinfo.txt
after the done keyword), and read from it explicitly (read line <&4 in
the while line).- This avoids conflict with any global redirection of standard output (e.g. from automatic logging).
- This avoids conflict with any global redirection of standard output (e.g. from automatic logging).
...
utility | default delimiter | how to change | example |
---|---|---|---|
cut | tab | -d or --delimiter option | cut -d ':' -f 1 /etc/passwd |
sort | whitespace (one ore more spaces or Tabs) | -t or --field-separator option | sort -t ':' -k1,1 /etc/passwd |
awk | whitespace (one ore more spaces or Tabs) Note: some older versions of awk do not treat Tabs as field separators. |
|
cat sampleinfo.txt /etc/passwd | awk -F "\t:" '{print $1,$3 }' |
join | one or more spaces | -t option |
|
perl | whitespace (one ore more spaces or Tabs) when auto-splitting input with -a | -F'/<pattern>/' option | cat sampleinfo.txt | perl -F'/\t/' -ane 'print "$F[0]\t$F[2]\n";' |
read | whitespace (one ore more spaces or tabs | IFS= option | see example above |
...
- Default field separators
- Tab is the default field separator for cut
- whitespace (one or more spaces or Tabs) is the default field separator for awk
- note that some older versions of awk do not include Tab as a default delimiter
- Re-ordering
- cut cannot re-order fields; awk can
cut -f 3,2
is the same ascut -f 2,3
. - awk does reorder fields, based on the order you specify
- cut cannot re-order fields; awk can
- awk is a full-featured programming language while cut is just a single-purpose utility.
...
- -n tells perl to feed the input to the script one line at a time
- -e introduces the perl script (
- always encode
- a command-line perl script in single quotes to protect it from shell evaluation
- $_ is a built-in variable holding the current line
- ~ is the perl pattern matching operator (=~ says pattern must match; ! ~ says pattern not matching)
- the forward slashes ("/ /") enclose the regex pattern
...
Code Block | ||
---|---|---|
| ||
tree /stor/work/CCBB_Workshops_1/bash_scripting/fastq/ |
There are 4 FASTQ files we want to manipulate. Let
Here's start with a for loop to get their full paths, and just the FASTQ file names without the _a one-liner that isolates just the sample names:
Code Block | ||
---|---|---|
| ||
find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" \
| perl -pe 's|.*/||' | perl -pe 's/_S\d.*//'
|
But what if we want to manipulate each of the 4 FASTQ files? For eample, count the number of lines in each one. Let's start with a for loop to get their full paths, and just the FASTQ file names without the _R1_001.fastq.gz suffix:
Code Block | ||
---|---|---|
| ||
# This is how to get all 4 full pathnames:
find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz"
# In a for loop, strip off the directory and the common _R1_001.fastq.gz file suffix
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
file=`basename $path`
pfx=${file%%_R1_001.fastq.gz}
echo "$pfx - $file"
done |
...
Code Block | ||
---|---|---|
| ||
# Shorten the sample prefix some more... for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do file=`basename $path` pfx=${file%%_R1_001.fastq.gz} pfx=$( echo $pfx | sed 's/00//' | perl -pe '~s/_S\d+.*//') echo "$pfx - $file" done |
...
Code Block | ||
---|---|---|
| ||
# Clunky way to do arithmetic in bash -- but bash only does integer arithmetic!
echo $(( `zcat $path | wc -l` / 4 ))
# Better way using awk
zcat $path | wc -l | awk '{print $1/4}' |
...
Code Block | ||
---|---|---|
| ||
# Count the sequence reads in each FASTQ file and save info to a file
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
file=`basename $path`
pfx=${file%%_R1_001.fastq.gz}
pfx=$( echo $pfx | sed 's/00//' | perl -pe '~s/_S\d+//')
echo "Counting reads in $file..." 1>&2
nSeq=$( zcat $path | wc -l | awk '{print $1/4}' )
echo -e "$nSeq\t$pfx\t$file\t$path"
done | tee fastq_stats.txt
cut -f 1-3 fastq_stats.txt
# produces this output:
1302707 WT-1_L2 WT-1_S5_L002_R1_001.fastq.gz
1409369 WT-1_L1 WT-1_S5_L001_R1_001.fastq.gz
2281687 WT-2_L2 WT-2_S12_L002_R1_001.fastq.gz
2496141 WT-2_L1 WT-2_S12_L001_R1_001.fastq.gz |
...
Code Block | ||
---|---|---|
| ||
cat fastq_stats.txt | awk '
BEGIN{FS="\t"; tot=0; ct=0}
{tot = tot + $1
ct = ct + 1}
END{print "Total of",tot,"sequences in",ct,"files";
printf("Mean reads per file: %d\n", tot/ct)}'
# produces this output:
Total of 7489904 sequences in 4 files
Mean reads per file: 1872476 |
...
Code Block | ||
---|---|---|
| ||
cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq -c
# produces this output:
2 WT-1
2 WT-2 |
...
Code Block | ||
---|---|---|
| ||
for samp in `cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq`; do
echo "sample $samp" 1>&2
cat fastq_stats.txt | grep -P "\t${samp}_L\d" | awk '
BEGIN{FS="\t"; tot=0}{tot = tot + $1; pfx = $2}
END{print pfx,"has",tot,"total reads"}'
done | tee sample_stats.txt
cat sample_stats.txt
# produces this output:
WT-1_L1 has 2712076 total reads
WT-2_L1 has 4777828 total reads |