Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Files and File systems

First, let's review Intro Unix: Files and File Systems. The most important takeaways are:

Working with remote files

scp (secure copy)

The cp command only copies files/directories with the local host's file systems. The scp command is similar to cp, but scp lets you securely copy files from one machine to another. And also like cp, scp has a -r (recursive) option to copy directories.

scp usage is similar to cp in that it copies from a <source> to a <destination>, but uses remote machine addressing to qualify either the <source> or the <destination> but not both. Remote machine addressing looks like this: <user_account>@<hostname>:<source_or_destination>

Examples: 

Open a new Terminal (Mac) or Command Prompt (Window) window on your local computer (not logged in to your student account), and try the following, using your studentNN account and GSAF pod host.  Note that you will always be prompted for your credentials on the remote host when you execute an scp command.

Code Block
languagebash
titlescp a single file
# On your local computer - not gsafcomp01 or gsafcomp02

# copy "haiku.txt" from your remote student Home directory to your current local directory
scp student01@gsafcomp01.ccbb.utexas.edu:~/haiku.txt . 

# copy "haiku.txt", now in your local current directory, to your remote student 
# Home directory with the name "haiku2.txt"
scp ./haiku.txt student01@gsafcomp01.ccbb.utexas.edu:~/haiku2.txt


Code Block
languagebash
titlescp a directory
# On your local computer - not gsafcomp01 or gsafcomp02

# copy the "docs" directory and its contents from your remote student Home directory 
# to a local sub-directory called "local_docs"
scp -r student01@gsafcomp01.ccbb.utexas.edu:~/docs/ ./local_docs/

# copy the "local_docs" sub-directory in your local current directory, to your 
#  remote student Home directory with the name "remote_docs"
scp -r ./local_docs/ student01@gsafcomp01.ccbb.utexas.edu:~/remote_docs/

wget (web get)

The wget <url> command lets you retrieve the contents of a valid Internet URL (e.g. http, https, ftp).

  • By default the downloaded file will be stored in the directory where you execute wget
    • with a filename based on the last component of the URL
  • The -O <path> option specifies the file or pathname where the URL data should be written.

Example:

Code Block
# Make a new "wget" directory in your student Home directory and change into it
mkdir ~/wget; cd ~/wget

# download a Gencode statistics file using default output file naming
wget "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt"
wc -l _README_stats.txt

# if you execute the same wget again, and the output file already exists
# wget will create a new one with a numeric extension
wget "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt"
wc -l _README_stats.txt.1

# download the same Gencode statistics file to a different local filename
wget -O gencode_stats.txt "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt"
wc -l gencode_stats.txt

The find command

TBD

TBD

About compressed files

Because a lot of scientific data is large, it is often stored in a compressed format to conserve storage space. The most common compression program used for individual files is gzip whose compressed files have the .gz extension. The tar and zip programs are most commonly used for compressing directories.

Let's see how that works by using a small FASTQ file (~/data/fastq/small.fz) that contains NGS read data where each sequence is represented by 4 lines.

Code Block
languagebash
cd ~/data/fastq       # change into your ~/data/fastq directory
ls -lh small.fq       # small.fq is 66K (~66,000) bytes long
wc -l small.fq        # small.fq is 1000 lines long

By default, when you call gzip on a file it compresses it in place, creating a file with the same name plus a .gz extension.

Code Block
languagebash
gzip small.fq         # compress the small.fq file in place, producing small.fq.gz file
ls -lh small.fq.gz    # small.fq.gz is only 15K bytes -- 4x smaller!

The gunzip command does the reverse – decompresses the file and writes the results back without the .gz extension. gzip -d (decompress) does the same thing.

Code Block
languagebash
gunzip small.fq.gz    # decompress the small.fq.gz file in place, producing small.fq file
# or
gzip -d small.fq.gz

Both gzip and gunzip also have -c or --stdout  options that tell the command to write on standard output, keeping the original files unchanged.

Code Block
languagebash
cd ~/data/fastq       # change into your ~/data/fastq directory
ls small.fq           # make sure you have an uncompressed "small.fq" file

gzip -c small.fq > sm2.fq.gz  # compress the "small.fq" into a new file called "sm2.fq.gz"
gunzip -c sm2.fq.gz > sm3.fq  # decompress "sm2.fq.gz" into a new "sm3.fq" file

Both gzip and gunzip can also accept data on standard input. In that case, the output is always on standard output.

Code Block
languagebash
cd ~/data/fastq       # change into your ~/data/fastq directory
ls small.fq           # make sure you have an uncompressed "small.fq" file

cat small.fq | gzip > small.fq.gz

The good news is that most bioinformatics programs can accept data in compressed gzipped format. But how do you view these compressed files?

  • The less pager accepts gzipped files as input
  • The zcat command is like cat, but works on gzipped files

Here are some ways to work with a compressed file:

Code Block
languagebash
cd                                      # make sure you're in your Home directory
cat jabberwocky.txt | gzip > jabber.gz  # make a compressed copy of the "jabberwocky.txt" file
less jabber.gz                          # use 'less' to view the compressed "jabber.gz" file (q to exit)

zcat jabber.gz | wc -l                       # count lines in the compressed "jabber.gz" file
zcat jabber.gz | tail -4                     # view the last 4 lines of the "jabber.gz" file
zcat jabber.gz | cat -n                      # view "jabber.gz" text with line numbers (no zcat -n option)
zcat jabber.gz | cat -n | tail +6 | head -4  # display lines 6 - 9 of "jabber.gz" text

Exercise 1-1

Display lines 6 - 9 of the compressed "jabber.gz" text

Expand
titleHint...

zcat, cat -n tail/head or head/tail


Expand
titleHint...

zcat jabber.gz | cat -n | tail +6 | head -4
- or -
zcat jabber.gz | cat -n | head -10 | tail -4

Working with 3rd party program I/O

Recall the three standard Unix streams: they each have a number, a name and redirection syntax:

  • standard output is stream 1
    • redirect standard output to a file with a the > or 1> operator
      • a single > or 1> overwrites any existing data in the target file
      • a double >> or 1>> appends to any existing data in the target file
  • standard error is stream 2
    • redirect standard error to a file with a the 2> operator
      • a single 2> overwrites any existing data in the target file
      • a double 2>> appends to any existing data in the target file

We also saw that 3rd party bioinformatics tools are often written as a top-level program that handles multiple sub-commands. Examples include the bwa NGS aligner and samtools and bedtools tool suites. To see their menu of sub-commands, you usually just need to enter the top-level command, or <command> --help. Similarly, sub-command usage is usually available as <command> <sub-command> or <command> <sub-command> --help.

Tip
title3rd party tools and standard streams

Many tools write their main output to standard output by default but have options to write it to a file instead.

Similarly, tools often write processing status and diagnostics to standard error, and it is usually your responsibility to redirect this elsewhere (e.g. to a log file).

Finally, tools may support taking their main input from standard input, but need a "placeholder" argument where you'd usually specify a file. That standard input placeholder is usually a single dash ( - ) but can also be a reserved word such as stdin.

Now let's see how these concepts fit together when running 3rd party tools.

Exercise 1-1 bwa aln

Where does the bwa aln sub-command write its output?

Expand
titleAnswer...

The bwa aln usage

Usage:   bwa aln [options] <prefix> <in.fq>

does not specify an output file, so it must write its alignment information to standard output.

How can this be changed?

Expand
titleAnswer...

The bwa aln options usage says:

      -f FILE   file to write output to instead of stdout

bwa aln also writes diagnostic progress as it runs, to standard error. Show how you would invoke bwa aln to capture both its alignment output and its progress diagnostics. Use input from a my_fastq.fq file and ./refs/hg38 as the <prefix>.

Expand
titleAnswers...

Redirecting the output to a file:
bwa aln ./refs/hg38 my_fastq.fq > my_fastq.aln  2>my_fastq.aln.log

Using the -f option:
bwa aln -f my_fastq.aln ./refs/hg38  2>my_fastq.aln.log




Exercise 1-2 cutadapt

The cutadapt adapter trimming command reads NGS sequences from a FASTQ file, and writes adapter-trimmed reads to a FASTQ file. Find its usage.

Expand
titleAnswer...

cutadapt --help | more

Note that it also points you to https://cutadapt.readthedocs.io/ for full documentation.

Usage:

    cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

Where does cutadapt write its output to from by default? How can that be changed?

Expand
titleAnswer...

The fastx_trimmer usage says that output is written to a file using the -o option

    cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

But the brackets around [-o output.fastq] suggest this is optional. Reading a bit further we see:

...                                                  Use the file name '-' for
standard input/output. Without the -o option, output is sent to standard output.

x

Where does fastx_trimmer write its input from by default? How can that be changed?

Expand
titleAnswer...

The fastx_trimmer options usage says:

    [-i INFILE]  = FASTA/Q input file. default is STDIN.