Table of Contents |
---|
Files and File systems
First, let's review Intro Unix: Files and File Systems. The most important takeaways are:
- Understanding the tree-like structure of directories and files in the file system hierarchy
- Absolute paths start with a slash ( / ), the root of the file system hierarchy
- More at: Intro Unix: Files and File Systems: The File System hierarchy
- Absolute paths start with a slash ( / ), the root of the file system hierarchy
- Knowing how to navigate the file system using the cd (change directory) command, Tab key completion, and relative path syntax:
- use the dot ( . ) metacharacter for the current directory
- use the dot-dot ( .. ) metacharacters for the parent directory
- More at:
- Selecting multiple files using pathname wildcards (a.k.a. "globbing")
- asterisk ( * ) to match any length of characters
- brackets ( [ ] ) match any character between the brackets, including hyphen ( - ) delimited character ranges such as [A-G]
- More at: Intro Unix: Files and File Systems: Pathname wildcards (globbing)
- A basic understanding of file attributes such as
- file type (file, directory)
- owner and group
- permissions (read, write, execute) for the owner, group and everyone
- More at: Intro Unix: Files and File Systems: File attributes
- Familiarly with basic file manipulation commands (mkdir, cp, mv, rm)
Working with remote files
scp (secure copy)
The cp command only copies files/directories with the local host's file systems. The scp command is similar to cp, but scp lets you securely copy files from one machine to another. And also like cp, scp has a -r (recursive) option to copy directories.
scp usage is similar to cp in that it copies from a <source> to a <destination>, but uses remote machine addressing to qualify either the <source> or the <destination> but not both.
Remote machine addressing looks like this: <user_account>@<hostname>:<source_or_destination>
Examples:
Open a new Terminal (Mac) or Command Prompt (Window) window on your local computer (not logged in to your student account), and try the following, using your studentNN account and GSAF pod host. Note that you will always be prompted for your credentials on the remote host when you execute an scp command.
...
language | bash |
---|---|
title | scp a single file |
Working with remote files
scp (secure copy)
The cp command only copies files/directories with the local host's file systems. The scp command is similar to cp, but scp lets you securely copy files from one machine to another. And also like cp, scp has a -r (recursive) option to copy directories.
scp usage is similar to cp in that it copies from a <source> to a <destination>, but uses remote machine addressing to qualify either the <source> or the <destination> but not both.
Remote machine addressing looks like this: <user_account>@<hostname>:<source_or_destination>
Examples:
Open a new Terminal (Mac) or Command Prompt (Window) window on your local computer (not logged in to your student account), and try the following, using your studentNN account and GSAF pod host.
Note that you will always be prompted for your credentials on the remote host when you execute an scp command.
To copy a remote file:
Code Block | ||||
---|---|---|---|---|
| ||||
# On your local computer - not gsafcomp01 or gsafcomp02 # Be sure to use your assigned student account and hostname # copy "haiku.txt" from your remote student Home directory to your current local directory scp student01@gsafcomp01.ccbb.utexas.edu:~/haiku.txt . # copy "haiku.txt", now in your local current directory, to your remote student # Home directory with the name "haiku2.txt" scp ./haiku.txt student01@gsafcomp01.ccbb.utexas.edu:~/haiku2.txt |
To copy a remote directory:
Code Block | ||||
---|---|---|---|---|
| ||||
# On your local computer - not gsafcomp01 or gsafcomp02 gsafcomp02 # Be sure to use your assigned student account and hostname # copy the "docs" directory and its contents from your remote student Home directory # to a local sub-directory called "local_docs" scp -r student01@gsafcomp01.ccbb.utexas.edu:~/docs/ ./local_docs/ # copy the "local_docs" sub-directory in your local current directory, to your # remote student Home the "local_docs" sub-directory in your local current directory, to your # remote student Home directory with the name "remote_docs" scp -r ./local_docs/ student01@gsafcomp01.ccbb.utexas.edu:~/remote_docs/directory with the name "remote_docs" scp -r ./local_docs/ student01@gsafcomp01.ccbb.utexas.edu:~/remote_docs/ |
Tip |
---|
When transferring files between your computer and a remote server, you always need to execute the command on your local computer. This is because your personal computer does not have an entry in the global hostname database, whereas the remote computer does. The global Domain Name Service, or DNS database maps full host names to their IP (Internet Protocol) address. Computers that can be accessed from anywhere on the Internet have their host names registered in DNS. |
wget (web get)
The wget <url> command lets you retrieve the contents of a valid Internet URL (e.g. http, https, ftp).
...
Code Block |
---|
# Make a new "wget" directory in your student Home directory and change into it mkdir -p ~/wget; cd ~/wget # download a Gencode statistics file using default output file naming wget "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt" wc -l _README_stats.txt # if you execute the same wget again, and the output file already exists # wget will create a new one with a numeric extension wget "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt" wc -l _README_stats.txt.1* # download the same Gencode statistics file to a different local filename wget -O gencode_stats.txt "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt" wc -l gencode_stats.txt |
...
- looks for files matching <expression> in <in_directory> and its sub-directories
- <expression> can be a double-quoted string including pathname wildcards (e.g. "[a-g]*.txt")
- there are tons of operators and tests:
- -type f (file) and -type d (directory) are useful tests
- -maxdepth NNis a useful operator to limit the depth of recursion.
- returns a list of matching relative pathnames, relative to in the <in_directory>, one per output line.
...
Expand | ||
---|---|---|
| ||
find /stor/work/CBRS_unix/fastq/ -name "*2020-07-10*" -type d | wc -l |
...
- ln -s <path> says to create a symbolic link link (symlink) to the specified file (or directory) in the current directory
- always use the -s option to avoid creating a hard link, which behaves quite differently
- the default link name corresponds to the last name component in <path>
- you can name the link file differently by supplying an optional link_file_name.
- it is best to change into (cd) the directory where you want the link before executing ln -s
- a symbolic link can be deleted without affecting the linked-to file
- the -f (force) option says to overwrite any existing file symbolic link with the same name
Examples:
...
- The 10-character permissions field (
lrwxrwxrwx
) has anl
in the left-most file type position, indicating this is a symbolic link. - The symlink itself is colored differently – in cyan
- There are two extra fields after the symlink name
- field 10 has an arrow -> pointing to ...field 11
- field 11 the path of the linked-to file ("../haiku.txt")
...
- find returns a list of matching file paths on its standard output
- ln wants its files listed as arguments, not on standard input
- so the paths are piped to the standard input of xargs
- xargs takes the data on its standard input and calls the specified function (here ln -sf -t .) with that data as the function's argument list.
...
Code Block | ||
---|---|---|
| ||
# copy a small.fq file into a new ~/gzips directory cd; mkdir -p ~/gzips cp -p /stor/work/CCBB_Workshops_1/misc_data/fastq/small.fq ~/gzips/ cd ~/gzips ls -lh # small.fq is 66K (~66,000) bytes long wc -l small.fq # small.fq is 1000 lines long |
...
The gunzip command does the reverse – decompresses the file and writes the results back without the .gz extension. gzip -d (decompress) does the same thing.
Code Block | ||
---|---|---|
| ||
gunzip small.fq.gz # decompress the small.fq.gz file in place, producing small.fq file gunzip small.fq.gz # or gzip -d small.fq.gz |
Both gzip and gunzip also have -c or --stdout options that tell the command to write on standard output, keeping the original files unchanged.
Code Block | ||
---|---|---|
| ||
cd ~/gzips # change into your ~/gzips directory ls small.fq # make sure you have an uncompressed "small.fq" file gzip -c small.fq > sm2.fq.gz # compress the "small.fq" into a new file called "sm2.fq.gz" gunzip -c sm2.fq.gz > sm3.fq # decompress "sm2.fq.gz" into a new "sm3.fq" file" file ls -lh |
Both gzip and gunzip can also accept data on standard input. In that case, the output is always on standard output.
Code Block | ||
---|---|---|
| ||
cd ~/gzips # change into your ~/gzips directory ls small.fq # make sure you have an uncompressed "small.fq" file cat small.fq | gzip > smallsm4.fq.gz |
The good news is that most bioinformatics programs can accept data in compressed gzipped format. But how do you view these compressed files?
...
Code Block | ||
---|---|---|
| ||
cd ~/gzips cat ../jabberwocky.txt | gzip > jabber.gz # make a compressed copy of the "jabberwocky.txt" file less jabber.gz of "jabberwocky.txt" less jabber.gz # use 'less' to view compressed "jabber.gz" # use 'less' to view the compressed# "jabber.gz" file (type 'q' to exit) zcat jabber.gz | wc -l # count lines in the compressed "jabber.gz" file zcat jabber.gz | tail -4 # view the last 4 lines of the "jabber.gz" file zcat jabber.gz | cat -n # view "jabber.gz" |text with catline -nnumbers # view "jabber.gz" text with line numbers (no zcat -n option) zcat jabber.gz | cat -n | tail +6 | head -4 # display(zcat linesdoes 6not -have 9an of "jabber.gz" text-n option) |
Exercise 2-2
Display lines 6 7 - 9 of the compressed "jabber.gz" text
Expand | ||
---|---|---|
| ||
zcat jabber.gz | cat -n | tail +6 7 | head -43 |
Working with 3rd party program I/O
...
3rd party tool files and streams
In Intro Unix: The Bash shell and commands: Getting help we saw that 3rd Third party bioinformatics tools are often written to perform sub-command processing; that is, they have a top-level program that handles multiple sub-commands. Examples include the bwa NGS aligner and the samtools and bedtools tool suites.
...
Tip | ||
---|---|---|
| ||
Many tools write their main output to standard output by default but have options to write it to a file instead. Similarly, tools often write processing status and diagnostics to standard error, and it is usually your responsibility to redirect this elsewhere (e.g. to a log file). Finally, tools may support taking their main input from standard input, but need a "placeholder" argument where you'd usually specify a file. That standard input placeholder is usually a single dash ( - ) but can also be a reserved word such as stdin. |
Now let's see how these concepts fit together when running 3rd party tools.
...
be a reserved word such as stdin. |
Now let's see how these concepts fit together when running 3rd party tools.
Exercise 2-3 bwa mem
Display the bwa mem sub-command usage using the more pager
Expand | ||
---|---|---|
| ||
Just typing bwa mem | more doesn't use the more pager! That's because bwa writes its usage information to standard error, not to standard output. So you have to use the funky 2>&1 syntax before piping to more: bwa mem 2>&1 | more |
Where does the bwa mem sub-command write its output?
Expand | ||
---|---|---|
| ||
The bwa mem usage says:
This does not specify an output file, so it must write its alignment information to standard output. |
...
output file, so it must write its alignment information to standard output. |
How can this be changed?
Expand | ||
---|---|---|
| ||
The bwa mem options usage says:
|
bwa mem also writes diagnostic progress as it runs, to standard error.
Expand | ||
---|---|---|
| ||
The bwa mem options usage says:
|
...
|
Show how you would invoke bwa mem to capture both its alignment output and its progress diagnostics. Use input from a my_fastq.fq file and ./refs/hg38 as the <idxbase>.
...
The cutadapt adapter trimming command reads NGS sequences from a FASTQ file, and writes adapter-trimmed reads to a FASTQ file. Find its usage.
Expand | ||
---|---|---|
| ||
cutadapt # overview; tells you to run cutadapt --help for details Note that it also points you to https://cutadapt.readthedocs.io/ for full documentation.
|
Where does cutadapt write its output to from by default? How can that be changed?
Expand | ||
---|---|---|
| ||
The cutadapt usage says that output can be written to a file using the -o option
But the The brackets around [-o output.fastq] suggest this is optional. Reading a bit further we see:
This suggests output can be specified in 2 ways:
|
Where does cutadapt read its input from by default? How can that be changed? Can the input FASTQ be in compressed format?
Expand | ||
---|---|---|
| ||
The cutadapt usage says an input.fastq file is a required argument:
But again, reading a bit further we see:
This says that the input.fastq file can be provided in one of three compression formats. And the usage also suggests input can be specified in 2 ways:
|
Where does cutadapt write its diagnostic output by default? How can that be changed?
Expand | ||
---|---|---|
| ||
The cutadapt usage doesn't say anything directly about diagnostics:
But again, reading in the Output: options section:
Careful reading of this suggests that: When
|
Expand | |||||
---|---|---|---|---|---|
| |||||
|