Part 2: Working with files

Files and File systems

First, let's review Intro Unix: Files and File Systems. The most important takeaways are:

Understanding the tree-like structure of directories and files in the file system hierarchy
- More at: Intro Unix: Files and File Systems: The File System hierarchy
Knowing how to navigate the file system using the cd (change directory) command, Tab key completion, and relative path syntax:
- use the dot ( . ) metacharacter for the current directory
- use the dot-dot ( .. ) metacharacters for the parent directory
- More at:
  - Intro Unix: Files and File Systems: Navigating the file system
  - Intro Unix: Files and File Systems: Relative pathname syntax
Selecting multiple files using pathname wildcards (a.k.a. "globbing")
- asterisk ( * ) to match any length of characters
- brackets ( [ ] ) match any character between the brackets, including hyphen ( - ) delimited character ranges such as [A-G]
- More at: Intro Unix: Files and File Systems: Pathname wildcards (globbing)
A basic understanding of file attributes such as
- file type (file, directory)
- owner and group
- permissions (read, write, execute) for the owner, group and everyone
- More at: Intro Unix: Files and File Systems: File attributes
Familiarly with basic file manipulation commands (mkdir, cp, mv, rm)
- Intro Unix: Files and File Systems: Basic file manipulation commands

Working with remote files

scp (secure copy)

The cp command only copies files/directories with the local host's file systems. The scp command is similar to cp, but scp lets you securely copy files from one machine to another. And also like cp, scp has a -r (recursive) option to copy directories.

scp usage is similar to cp in that it copies from a <source> to a <destination>, but uses remote machine addressing to qualify either the <source> or the <destination> but not both. Remote machine addressing looks like this: <user_account>@<hostname>:<source_or_destination>

Examples:

Open a new Terminal (Mac) or Command Prompt (Window) window on your local computer (not logged in to your student account), and try the following, using your studentNN account and GSAF pod host. Note that you will always be prompted for your credentials on the remote host when you execute an scp command.

scp a single file

# On your local computer - not gsafcomp01 or gsafcomp02

# copy "haiku.txt" from your remote student Home directory to your current local directory
scp student01@gsafcomp01.ccbb.utexas.edu:~/haiku.txt . 

# copy "haiku.txt", now in your local current directory, to your remote student 
# Home directory with the name "haiku2.txt"
scp ./haiku.txt student01@gsafcomp01.ccbb.utexas.edu:~/haiku2.txt

scp a directory

# On your local computer - not gsafcomp01 or gsafcomp02

# copy the "docs" directory and its contents from your remote student Home directory 
# to a local sub-directory called "local_docs"
scp -r student01@gsafcomp01.ccbb.utexas.edu:~/docs/ ./local_docs/

# copy the "local_docs" sub-directory in your local current directory, to your 
#  remote student Home directory with the name "remote_docs"
scp -r ./local_docs/ student01@gsafcomp01.ccbb.utexas.edu:~/remote_docs/

wget (web get)

The wget <url> command lets you retrieve the contents of a valid Internet URL (e.g. http, https, ftp).

By default the downloaded file will be stored in the directory where you execute wget
- with a filename based on the last component of the URL
The -O <path> option specifies the file or pathname where the URL data should be written.

Example:

# Make a new "wget" directory in your student Home directory and change into it
mkdir -p ~/wget; cd ~/wget

# download a Gencode statistics file using default output file naming
wget "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt"
wc -l _README_stats.txt

# if you execute the same wget again, and the output file already exists
# wget will create a new one with a numeric extension
wget "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt"
wc -l _README_stats.txt.1

# download the same Gencode statistics file to a different local filename
wget -O gencode_stats.txt "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt"
wc -l gencode_stats.txt

The find command

The find command is a powerful – and of course complex! – way of looking for files in a nested directory hierarchy. The general form I use is:

find <in_directory> [ operators ] -name <expression> [ tests ]
- looks for files matching <expression> in <in_directory> and its sub-directories
- <expression> can be a double-quoted string including pathname wildcards (e.g. "[a-g]*.txt")
- there are tons of operators and tests:
  - -type f (file) and -type d (directory) are useful tests
  - -maxdepth NN is a useful operator to limit the depth of recursion.
- returns a list of matching relative pathnames, relative to <in_directory>, one per output line.

Examples:

cd
find . -name "*.txt" -type f     # find all .txt files in the Home directory
find . -name "*docs*" -type d    # find all directories with "docs" in the directory name

Exercise 2-1

The /stor/work/CBRS_unix/fastq/ directory contains sequencing data from a GSAF Job. Its structure, as shown by tree, is:

Use find to find all fastq.gz files in /stor/work/CBRS_unix/fastq/.

Answer...

find /stor/work/CBRS_unix/fastq/ -name "*.fastq.gz" -type f
returns 4 file paths

How many fastq.gz files in /stor/work/CBRS_unix/fastq/ were run in sequencer lane L001.

Answer...

find /stor/work/CBRS_unix/fastq/ -name "*L001*fastq.gz" -type f | wc -l
reports 2 file paths

How many sample directories in /stor/work/CBRS_unix/fastq/ were run on July 10, 2020?

Answer...

find /stor/work/CBRS_unix/fastq/ -name "*2020-07-10*" -type d | wc -l
reports 2 directory paths

Working with symbolic links

TBD

About compressed files

Because a lot of scientific data is large, it is often stored in a compressed format to conserve storage space. The most common compression program used for individual files is gzip whose compressed files have the .gz extension. The tar and zip programs are most commonly used for compressing directories.

Let's see how that works by using a small FASTQ file (~/data/fastq/small.fz) that contains NGS read data where each sequence is represented by 4 lines.

cd ~/data/fastq       # change into your ~/data/fastq directory
ls -lh small.fq       # small.fq is 66K (~66,000) bytes long
wc -l small.fq        # small.fq is 1000 lines long

By default, when you call gzip on a file it compresses it in place, creating a file with the same name plus a .gz extension.

gzip small.fq         # compress the small.fq file in place, producing small.fq.gz file
ls -lh small.fq.gz    # small.fq.gz is only 15K bytes -- 4x smaller!

The gunzip command does the reverse – decompresses the file and writes the results back without the .gz extension. gzip -d (decompress) does the same thing.

gunzip small.fq.gz    # decompress the small.fq.gz file in place, producing small.fq file
# or
gzip -d small.fq.gz

Both gzip and gunzip also have -c or --stdout options that tell the command to write on standard output, keeping the original files unchanged.

cd ~/data/fastq       # change into your ~/data/fastq directory
ls small.fq           # make sure you have an uncompressed "small.fq" file

gzip -c small.fq > sm2.fq.gz  # compress the "small.fq" into a new file called "sm2.fq.gz"
gunzip -c sm2.fq.gz > sm3.fq  # decompress "sm2.fq.gz" into a new "sm3.fq" file

Both gzip and gunzip can also accept data on standard input. In that case, the output is always on standard output.

cd ~/data/fastq       # change into your ~/data/fastq directory
ls small.fq           # make sure you have an uncompressed "small.fq" file

cat small.fq | gzip > small.fq.gz

The good news is that most bioinformatics programs can accept data in compressed gzipped format. But how do you view these compressed files?

The less pager accepts gzipped files as input
The zcat command is like cat, but works on gzipped files

Here are some ways to work with a compressed file:

cd                                      # make sure you're in your Home directory
cat jabberwocky.txt | gzip > jabber.gz  # make a compressed copy of the "jabberwocky.txt" file
less jabber.gz                          # use 'less' to view the compressed "jabber.gz" file (q to exit)

zcat jabber.gz | wc -l                       # count lines in the compressed "jabber.gz" file
zcat jabber.gz | tail -4                     # view the last 4 lines of the "jabber.gz" file
zcat jabber.gz | cat -n                      # view "jabber.gz" text with line numbers (no zcat -n option)
zcat jabber.gz | cat -n | tail +6 | head -4  # display lines 6 - 9 of "jabber.gz" text

Exercise 1-1

Display lines 6 - 9 of the compressed "jabber.gz" text

Hint...

zcat, cat -n tail/head or head/tail

Hint...

Working with 3rd party program I/O

Recall the three standard Unix streams: they each have a number, a name and redirection syntax:

standard output is stream 1
- redirect standard output to a file with a the > or 1> operator
  - a single > or 1> overwrites any existing data in the target file
  - a double >> or 1>> appends to any existing data in the target file
standard error is stream 2
- redirect standard error to a file with a the 2> operator
  - a single 2> overwrites any existing data in the target file
  - a double 2>> appends to any existing data in the target file

We also saw that 3rd party bioinformatics tools are often written as a top-level program that handles multiple sub-commands. Examples include the bwa NGS aligner and samtools and bedtools tool suites. To see their menu of sub-commands, you usually just need to enter the top-level command, or <command> --help. Similarly, sub-command usage is usually available as <command> <sub-command> or <command> <sub-command> --help.

3rd party tools and standard streams

Many tools write their main output to standard output by default but have options to write it to a file instead.

Similarly, tools often write processing status and diagnostics to standard error, and it is usually your responsibility to redirect this elsewhere (e.g. to a log file).

Finally, tools may support taking their main input from standard input, but need a "placeholder" argument where you'd usually specify a file. That standard input placeholder is usually a single dash ( - ) but can also be a reserved word such as stdin.

Now let's see how these concepts fit together when running 3rd party tools.

Exercise 1-1 bwa aln

Where does the bwa aln sub-command write its output?

Answer...

The bwa aln usage

Usage: bwa aln [options] <prefix> <in.fq>

does not specify an output file, so it must write its alignment information to standard output.

How can this be changed?

Answer...

The bwa aln options usage says:

-f FILE file to write output to instead of stdout

bwa aln also writes diagnostic progress as it runs, to standard error. Show how you would invoke bwa aln to capture both its alignment output and its progress diagnostics. Use input from a my_fastq.fq file and ./refs/hg38 as the <prefix>.

Answers...

Redirecting the output to a file:
bwa aln ./refs/hg38 my_fastq.fq > my_fastq.aln 2>my_fastq.aln.log

Using the -f option:
bwa aln -f my_fastq.aln ./refs/hg38 2>my_fastq.aln.log

Exercise 1-2 cutadapt

The cutadapt adapter trimming command reads NGS sequences from a FASTQ file, and writes adapter-trimmed reads to a FASTQ file. Find its usage.

Answer...

cutadapt --help | more

Note that it also points you to https://cutadapt.readthedocs.io/ for full documentation.

Usage:

cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

Where does cutadapt write its output to from by default? How can that be changed?

Answer...

The fastx_trimmer usage says that output is written to a file using the -o option

cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

But the brackets around [-o output.fastq] suggest this is optional. Reading a bit further we see:

... Use the file name '-' for
standard input/output. Without the -o option, output is sent to standard output.

x

Where does fastx_trimmer write its input from by default? How can that be changed?

Answer...

The fastx_trimmer options usage says:

[-i INFILE] = FASTA/Q input file. default is STDIN.