So we've covered a number of "framework" topics – argument handling, stream handling, error handling – in a systematic way. This section presents various tips and tricks for actually manipulating data, which can be useful both in writing scripts and in command line manipulations.
For data, we'll use a couple of files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing to deliver sequencing data to customers. These files have information about sequencing runs (a machine run, with many samples), sequencing jobs (representing a set of customer samples), and samples (a library of DNA molecules to sequence on the machine).
Here are links to the files if you need to download them after this class is over (you don't need to download them now, since we'll create symbolic links to them).
- joblist.txt - contains job name/sample name pairs, tab-delimited, no header
- sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
- columns (tab-delimited) are
job_name, job_id, sample_name, sample_id, date_string
, which names are in a header line
- columns (tab-delimited) are
create multiple symbolic links
When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln, e.g.:
mkdir ~/test; cd ~/test ln -s -f ~/workshop/data/sampleinfo.txt ls -l
Multiple files can be linked by providing multiple file name arguments along and using the -t (target) option to specify the directory where links to all the files can be created.
cd; rm -f test/*.* ln -s -f -t test~/workshop/data/*.txt ls -l
What about the case where the files you want are scattered in sub-directories? Here's a solution using find and xargs:
- find returns a list of matching file paths on its standard output
- the paths are piped to the standard input of xargs
- xargs takes the data on its standard input and calls the specified function (here ln) with that data as the function's argument list.
sort & uniq tricks
The ~/test/joblist.txt file you just symlink'd describes sequencing job/run pairs, tab-separated. We can use sort and uniq to collapse and count entries in the run name field (column 2):
cd ~/test cut -f 2 joblist.txt | sort | uniq | wc -l # there are 1244 runs
Job names are in column 1 of the ~/test/sampleinfo.txt file. Here's how to create a histogram of job names showing the count of samples (lines) for each. the -c option to uniq addes a count of unique items, which we can then sort on (numerically) to show the jobs with the most samples first.
cat sampleinfo.txt | tail -n +2 | cut -f 1 | sort | uniq -c | sort -k1,1nr
exercise 1
How many unique job names are in the joblist.txt file?
Are all the job/run pairs unique?
Which run has the most jobs?
field delimiter issues
sort – default delimiter is whitespace; change with -t or --field-separator option
utility | default delimiter | how to change | example |
---|---|---|---|
cut | tab | -d or --delimiter option | cut -d ':' -f 1 /etc/passwd |
sort | whitespace | -t or --field-separator option | sort -t ':' -k1,1 /etc/passwd |