Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

So we've examined a lot of "framework" issue – argument handling, stream handling, error handling – in a systematic way. This section presents various tips and tricks for actually manipulating data, which can be useful both in writing scripts and in command line manipulations.

For data, we'll use a couple of files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing to deliver sequencing data to customers. These files have information about sequencing runs (a machine run, with many samples), sequencing jobs (representing a set of customer samples), and samples (a library of DNA molecules to sequence on the machine).

Here are links to the files if you need to download them after this class is over (you don't need to download them now, since we'll create symbolic links to them).

  • joblist.txt - contains job name/sample name pairs, tab-delimited, no header
  • sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
    • columns (tab-delimited) are job_name, job_id, sample_name, sample_id, date_string, which names are in a header line

change shell text colors

If your terminal has a dark background, the default shell colors can be hard to read. Execute this line to display directory names in yellow (and put it in your ~/.profile login script)

...

Code Block
languagebash
cd ~/test
cut -f 2 joblist.txt | sort | uniq | wc -l
# there are 1244 runs

...

Job names are in column 1 of the ~/test/sampleinfo.txt file. Here's how to create a histogram of job names showing the count of samples (lines) for each. the -c option to uniq addes a count

...

of unique items, which we can then sort on (numerically) to show the jobs with the most samples first.

Code Block
languagebash
cat sampleinfo.txt | tail -n +2 | cut -f 1 | sort | uniq -c | sort -k1,1nr

exercise 1

How many unique job names are in the joblist.txt file?

...