...
So we've examined a lot of "framework" issue – argument handling, stream handling, error handling – in a systematic way. This section presents various tips and tricks for actually manipulating data, which can be useful both in writing scripts and in command line manipulations.
For data, we'll use a couple of files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing to deliver sequencing data to customers. These files have information about sequencing runs (a machine run, with many samples), sequencing jobs (representing a set of customer samples), and samples (a library of DNA molecules to sequence on the machine).
Here are links to the files if you need to download them after this class is over (you don't need to download them now, since we'll create symbolic links to them).
- joblist.txt - contains job name/sample name pairs, tab-delimited, no header
- sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
- columns (tab-delimited) are
job_name, job_id, sample_name, sample_id, date_string
, which names are in a header line
- columns (tab-delimited) are
change shell text colors
If your terminal has a dark background, the default shell colors can be hard to read. Execute this line to display directory names in yellow (and put it in your ~/.profile login script)
...
Code Block | ||
---|---|---|
| ||
cd ~/test cut -f 2 joblist.txt | sort | uniq | wc -l # there are 1244 runs |
...
Job names are in column 1 of the ~/test/sampleinfo.txt file. Here's how to create a histogram of job names showing the count of samples (lines) for each. the -c option to uniq addes a count
...
of unique items, which we can then sort on (numerically) to show the jobs with the most samples first.
Code Block | ||
---|---|---|
| ||
cat sampleinfo.txt | tail -n +2 | cut -f 1 | sort | uniq -c | sort -k1,1nr |
exercise 1
How many unique job names are in the joblist.txt file?
...