Page Comparison

...

Once a line has been read, it can be parsed, for example, using cut, as shown below. Other notes:c

The double quotes around the text that "$line" are important to preserve special characters inside the original line (here tab characters).
- Without the double quotes, the line fields would be separated by spaces, and the cut field delimiter would need to be changed.
Some lines have an empty job name field; we replace job and sample names in this case.
We assign file descriptor 4 to the file data being read (4< sampleinfo.txt after the done keyword), and read from it explicitly (read line <&4 in the while line).
- This avoids conflict with any global redirection of standard output (e.g. from automatic logging).

...

utility	default delimiter	how to change	example
cut	tab	-d or --delimiter option	`cut -d ':' -f 1 /etc/passwd`
sort	whitespace (one ore more spaces or Tabs)	-t or --field-separator option	`sort -t ':' -k1,1 /etc/passwd`
awk	whitespace (one ore more spaces or Tabs) Note: some older versions of awk do not treat Tabs as field separators.	FS (input field separator) and/or OFS (*output field separator) variable in BEGIN{ }* block -F or --field-separator option	`cat sampleinfo.txt \| awk 'BEGIN{ FS=OFS="\t" }` `{print $1,$3}'` `cat sampleinfo.txt /etc/passwd \| awk -F "\t:" '{print $1,$3 }'`
join	one or more spaces	-t option	`join -t $'\t' -j 2 file1 file12`
perl	whitespace (one ore more spaces or Tabs) when auto-splitting input with -a	-F'/<pattern>/' option	`cat sampleinfo.txt \| perl -F'/\t/' -ane 'print "$F[0]\t$F[2]\n";'`
read	whitespace (one ore more spaces or tabs	IFS= option	see example above

...

Default field separators
- Tab is the default field separator for cut
- whitespace (one or more spaces or Tabs) is the default field separator for awk
  - note that some older versions of awk do not include Tab as a default delimiter
Re-ordering
- cut cannot re-order fields; awk can cut -f 3,2 is the same as cut -f 2,3.
- awk does reorder fields, based on the order you specify
awk is a full-featured programming language while cut is just a single-purpose utility.

...

-n tells perl to feed the input to the script one line at a time
-e introduces the perl script (
- always encode
it
- a command-line perl script in single quotes to protect it from shell evaluation
)
$_ is a built-in variable holding the current line
~ is the perl pattern matching operator (=~ says pattern must match; ! ~ says pattern not matching)
the forward slashes ("/ /") enclose the regex pattern

...

Code Block

language	bash

tree /stor/work/CCBB_Workshops_1/bash_scripting/fastq/

Image RemovedThere are 4 FASTQ files we want to manipulate. LetImage Added

Here's start with a for loop to get their full paths, and just the FASTQ file names without the _a one-liner that isolates just the sample names:

Code Block

language	bash

find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" \
  | perl -pe 's|.*/||' | perl -pe 's/_S\d.*//'

But what if we want to manipulate each of the 4 FASTQ files? For eample, count the number of lines in each one. Let's start with a for loop to get their full paths, and just the FASTQ file names without the _R1_001.fastq.gz suffix:

Code Block

language	bash

# This is how to get all 4 full pathnames:
find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz"


# In a for loop, strip off the directory and the common _R1_001.fastq.gz file suffix
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
  file=`basename $path`
  pfx=${file%%_R1_001.fastq.gz}
  echo "$pfx - $file"
done

...

Code Block

language	bash

# Shorten the sample prefix some more...
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
  file=`basename $path`
  pfx=${file%%_R1_001.fastq.gz}
  pfx=$( echo $pfx | sed 's/00//' | perl -pe '~s/_S\d+.*//')
  echo "$pfx - $file"
done

...

Code Block

language	bash

# Clunky way to do arithmetic in bash -- but bash only does integer arithmetic!
echo $(( `zcat $path | wc -l` / 4 ))


# Better way using awk
zcat $path | wc -l | awk '{print $1/4}'

...

Code Block

language	bash

# Count the sequence reads in each FASTQ file and save info to a file
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
  file=`basename $path`
  pfx=${file%%_R1_001.fastq.gz}
  pfx=$( echo $pfx | sed 's/00//' | perl -pe '~s/_S\d+//')
  echo "Counting reads in $file..." 1>&2
  nSeq=$( zcat $path | wc -l | awk '{print $1/4}' )
  echo -e "$nSeq\t$pfx\t$file\t$path"
done | tee fastq_stats.txt


cut -f 1-3 fastq_stats.txt
# produces this output:
1302707 WT-1_L2 WT-1_S5_L002_R1_001.fastq.gz
1409369 WT-1_L1 WT-1_S5_L001_R1_001.fastq.gz
2281687 WT-2_L2 WT-2_S12_L002_R1_001.fastq.gz
2496141 WT-2_L1 WT-2_S12_L001_R1_001.fastq.gz

...

Code Block

language	bash

cat fastq_stats.txt | awk '
  BEGIN{FS="\t"; tot=0; ct=0}
  {tot = tot + $1
   ct = ct + 1}
  END{print "Total of",tot,"sequences in",ct,"files";
      printf("Mean reads per file: %d\n", tot/ct)}'


# produces this output:
Total of 7489904 sequences in 4 files
Mean reads per file: 1872476

...

Code Block

language	bash

cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq -c


# produces this output:
      2 WT-1
      2 WT-2

...

Code Block

language	bash

for samp in `cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq`; do
  echo "sample $samp" 1>&2
  cat fastq_stats.txt | grep -P "\t${samp}_L\d" | awk '
    BEGIN{FS="\t"; tot=0}{tot = tot + $1; pfx = $2}
    END{print pfx,"has",tot,"total reads"}'
done | tee sample_stats.txt


cat sample_stats.txt
# produces this output:
WT-1_L1 has 2712076 total reads
WT-2_L1 has 4777828 total reads

Versions Compared

Old Version 44

New Version 45

Key