Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Once a line has been read, it can be parsed, for example, using cut, as shown below. Other notes:c

  • The double quotes around the text that "$line" are important to preserve special characters inside the original line (here tab characters).
    • Without the double quotes, the line fields would be separated by spaces, and the cut field delimiter would need to be changed.
  • Some lines have an empty job name field; we replace job and sample names in this case.
  • We assign file descriptor 4 to the file data being read (4< sampleinfo.txt after the done keyword), and read from it explicitly (read line <&4 in the while line).
    • This avoids conflict with any global redirection of standard output (e.g. from automatic logging).

...

utilitydefault delimiterhow to changeexample
cuttab-d or --delimiter optioncut -d ':' -f 1 /etc/passwd
sortwhitespace
(one ore more spaces or Tabs)
-t or --field-separator optionsort -t ':' -k1,1 /etc/passwd
awk

whitespace (one ore more spaces or Tabs)

Note: some older versions of awk do not treat Tabs as field separators.

  • FS (input field separator) and/or OFS (output field separator) variable in BEGIN{ } block
  • -F or --field-separator option

cat sampleinfo.txt | awk 'BEGIN{ FS=OFS="\t" } {print $1,$3}'

cat sampleinfo.txt /etc/passwd | awk -F "\t:" '{print $1,$3 }'
joinone or more spaces-t option
join -t $'\t' -j 2 file1 file12 
perlwhitespace
(one ore more spaces or Tabs)
when auto-splitting input with -a
-F'/<pattern>/' optioncat sampleinfo.txt | perl -F'/\t/' -ane 'print "$F[0]\t$F[2]\n";'
readwhitespace
(one ore more spaces or tabs
IFS= optionsee example above

...

  • Default field separators
    • Tab is the default field separator for cut
    • whitespace (one or more spaces or Tabs) is the default field separator for awk
      • note that some older versions of awk do not include Tab as a default delimiter
  • Re-ordering
    • cut cannot re-order fields; awk can cut -f 3,2 is the same as cut -f 2,3.
    • awk does reorder fields, based on the order you specify
  • awk is a full-featured programming language while cut is just a single-purpose utility.

...

  • -n tells perl to feed the input to the script one line at a time
  • -e introduces the perl script (
    • always encode
    it
    • a command-line perl script in single quotes to protect it from shell evaluation
    )
  • $_ is a built-in variable holding the current line
  • ~ is the perl pattern matching operator (=~ says pattern must match; ! ~ says pattern not matching)
  • the forward slashes ("/  /") enclose the regex pattern

...

Code Block
languagebash
tree /stor/work/CCBB_Workshops_1/bash_scripting/fastq/


Image RemovedThere are 4 FASTQ files we want to manipulate. LetImage Added

Here's start with a for loop to get their full paths, and just the FASTQ file names without the _a one-liner that isolates just the sample names:

Code Block
languagebash
find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" \
  | perl -pe 's|.*/||' | perl -pe 's/_S\d.*//'

But what if we want to manipulate each of the 4 FASTQ files? For eample, count the number of lines in each one. Let's start with a for loop to get their full paths, and just the FASTQ file names without the _R1_001.fastq.gz suffix:

Code Block
languagebash
# This is how to get all 4 full pathnames:
find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz"


# In a for loop, strip off the directory and the common _R1_001.fastq.gz file suffix
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
  file=`basename $path`
  pfx=${file%%_R1_001.fastq.gz}
  echo "$pfx - $file"
done

...

Code Block
languagebash
# Shorten the sample prefix some more...
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
  file=`basename $path`
  pfx=${file%%_R1_001.fastq.gz}
  pfx=$( echo $pfx | sed 's/00//' | perl -pe '~s/_S\d+.*//')
  echo "$pfx - $file"
done

...

Code Block
languagebash
# Clunky way to do arithmetic in bash -- but bash only does integer arithmetic!
echo $(( `zcat $path | wc -l` / 4 ))


# Better way using awk
zcat $path | wc -l | awk '{print $1/4}'

...

Code Block
languagebash
# Count the sequence reads in each FASTQ file and save info to a file
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
  file=`basename $path`
  pfx=${file%%_R1_001.fastq.gz}
  pfx=$( echo $pfx | sed 's/00//' | perl -pe '~s/_S\d+//')
  echo "Counting reads in $file..." 1>&2
  nSeq=$( zcat $path | wc -l | awk '{print $1/4}' )
  echo -e "$nSeq\t$pfx\t$file\t$path"
done | tee fastq_stats.txt


cut -f 1-3 fastq_stats.txt
# produces this output:
1302707 WT-1_L2 WT-1_S5_L002_R1_001.fastq.gz
1409369 WT-1_L1 WT-1_S5_L001_R1_001.fastq.gz
2281687 WT-2_L2 WT-2_S12_L002_R1_001.fastq.gz
2496141 WT-2_L1 WT-2_S12_L001_R1_001.fastq.gz

...

Code Block
languagebash
cat fastq_stats.txt | awk '
  BEGIN{FS="\t"; tot=0; ct=0}
  {tot = tot + $1
   ct = ct + 1}
  END{print "Total of",tot,"sequences in",ct,"files";
      printf("Mean reads per file: %d\n", tot/ct)}'


# produces this output:
Total of 7489904 sequences in 4 files
Mean reads per file: 1872476

...

Code Block
languagebash
cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq -c


# produces this output:
      2 WT-1
      2 WT-2

...

Code Block
languagebash
for samp in `cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq`; do
  echo "sample $samp" 1>&2
  cat fastq_stats.txt | grep -P "\t${samp}_L\d" | awk '
    BEGIN{FS="\t"; tot=0}{tot = tot + $1; pfx = $2}
    END{print pfx,"has",tot,"total reads"}'
done | tee sample_stats.txt


cat sample_stats.txt
# produces this output:
WT-1_L1 has 2712076 total reads
WT-2_L1 has 4777828 total reads