Overview:
This portion of the class is devoted to making sure we are all starting from the same starting point on lonestar. This tutorial was developed as a combined version of multiple other tutorials which were previously given credit here. Anyone wishing to use this tutorial is welcome.
This is probably the longest tutorial in the entire class. It is designed to take between 1/2 and 3/4 of the first class. Do not stress if you feel people are moving through it faster than you are, or if you do not get it done before the next presentation. There will be links back to this tutorial from other tutorials as needed, and by the 2nd half of Wednesday's class when we start with the specialized tutorials, you can circle back to this tutorial as well.
Objectives:
- Familiarize yourself with the way course material will be presented.
- Log into lonestar5.
- Change your lonestar profile to the course specific format.
- Refresh understanding of basic linux commands with some course organization.
- Review use of the nano text editor program, and become familiar with several other text editor programs.
Example things you will encounter in the course:
As this is the first real tutorial you are encountering in this course, some housekeeping matters to familiarize you with how information will be presented.
Code blocks
There will be 4 types of code blocks used throughout this class. Text inside of code blocks represent at least 1 possible correct answer, and should either be typed EXACTLY into the terminal window as they are, or copy pasted. There is a notable exception that text between <> symbols represent something that you need to replace before sending it to the terminal. Yes, the <> marks themselves also need to be replaced. We try to put informative text within the brackets so you know what to replace it with. If you are ever unsure of what to replace the <> text with, just ask.
- Visible
- These are code blocks that you would have no idea what to type without help. (like when a new command is being introduced)
- These will typically be associated with longer/more detailed text above the text box explaining things.
An example code block showing you the command you need to type into the prompt to list what directory you are currently in:
pwd
- Hinted
- These are code blocks that you can probably figure out what to type with a hint that goes beyond what the tutorial is requesting. Access the hint by clicking the triangle or hint hyperlink text.
- These exist to force you to think about what command you need, and hopefully make some connections to help you remember what you will need to type in the future.
- These should all come with additional explanation as to what is going on.
- Rather than just expanding these by reflex, I strongly suggest seeing if you can figure out what the command will be, and checking your work
Example:
- Hidden:
- These code blocks represent things that you should have seen several times already, or things that can be succinctly explained.
Example:
Speed bump:
- This combines the previous 2 types to deliberately slow you down and be cumbersome.
- If you find yourself consistently wrong about what eventually shows up in the text box, slow down, step back, think about whats going on, and consider asking a question.
- These should only come after you have seen the same (or very similar) commands in the other formats previously
Example:
Warnings
Why the tutorials have warnings?
Warnings exist for 2 reasons:
- Something you are about to do can have negative impact on you
- You saw an example of this talking about paying attention to warnings when using ssh to access new remote computers
- Something you are about to do can have negative impacts on others
- this will be related mostly to the use of "idev" sessions beginning tomorrow.
Info boxes
These are used to give more general background about things
These are somewhat new this year, and feedback on them is welcomed.
Tip boxes
Things I wish I knew sooner
As an example: On the command line, you can use the tab key to try to autofill the "rest" of whatever you are typing, weather it is the name of the directory, a long file, or even a command. Hitting tab twice will list all possible matches to whatever you have already typed when there are multiple different possibilities
Tutorial:
Logging into lonestar5
Hopefully you were able to log into ls5 last week as part of the pre-class assignment. If not make sure the instructor is aware as there are additional elements that still need to be addressed (potentially adding you to the project allocation and definitely being added to the reservation that we will use starting tomorrow).
When prompted enter your password, and digital security code from the app, and answer "yes" to the security question if you see one.
As a reminder, the ssh command, and launching programs to give you the prompt to type them was provided as part of the pre-class assignment. Convenient links incase you need them or want to refresh your memory:
Setting up your lonestar profile
There are many flavors of Linux/Unix shells. The default for TACC's Linux (and most other Linuxes) is bash (bourne again shell), which we will use throughout.
Whenever you login via an interactive shell as you did above, a well-known script is executed by the shell to establish your favorite environment settings. I've set up a common profile for you to start with that will help you know where you are in the file system and make it easier to access some of our shared resources. If you already have a profile set up on lonestar that you like, we want to make sure that we don't destroy it but it is critical to make sure that we change it temporarily so everyone is working from the same place through the class. Use the ls command to check if you have a profile already set up in your home directory.
If you already have a .profile or .bashrc file, use the mv command to change the name to something descriptive (for example ".profile_pre_GVA_backup"). Otherwise continue to creating a new files.
A warning about deleting files
Most of us are used to having an 'undo' button, trash/recycling collection of deleted files, or warnings when we tell a computer to do something that can't be undone. The command line offers none of these options. In extreme situations on TACC, you can use the help desk ticket system to recover a deleted file, but there is no guarantee files can be recovered under normal circumstances (we will cover exceptions to this later).
The specific warning right now is that if you have an existing profile, and have not done the above commands correctly, you will not be able to recover your existing profile. Thus this is a great opportunity to interact with your instructor and make 100% the above steps have been correctly performed. Type ls -al
onto the command line and then share your screen on zoom if you are not sure
Now that we have backed up your profiles so you won't lose any previous settings, you can copy our predefined GVA2020.bashrc file from the /corral-repl/utexas/BioITeam/scripts/
folder to your $HOME folder as .bashrc and the predefined GVA2020.profile as .profile from the same location before using the chmod command to change the permissions to read and write for the user only.
The chmod 700 <FILE> command marks the file as readable/writable/executable only by you. The .bashrc script file will not be executed unless it has these permissions settings.
Understanding why some files start with a "."
In the above code box, you see that the names start with a . when a filename starts with a . it conveys a special meaning to the operating system/command line. Specifically, it prevents that file from being displayed when you use the ls
command unless you specifically as for hidden files to be displayed using the -a
option. Such files are termed "dot-files" if you are interested in researching them further.
Let's look at a few different ways we will use the ls
command throughout the course. Compare the output of the following 4 commands:
ls #ignore everything that comes after the # mark. There is a problem on this wiki page but things after a # wont effect commands
ls -a
ls -a -1
ls -a -l
Throughout the course you will notice that many options are supplied to commands via a single dash immediately followed by a single letter. Usually when you have multiple commands supplied in this manner you can combine all the letters after a single dash to make things easier/faster to type. Experiment a little to prove to yourself that the following 2 commands give the same output.
ls -a -1 ls -al
While knowing that you can combine options in this way helps you analyze data faster/better, the real value comes from being able to decipher commands you come across on help forums, or in publications.
For ls specifically the following association table is worth making note of, but if you want the 'official' names consider using the man
command to bring up the ls manual.
flag | association |
---|---|
-a | "all" files |
-l | "long" listing of file information |
-1 | 1 column |
Getting back to your profile... Since .bashrc is executed when you login, to ensure it is set up properly you should first logout:
then log back in:
If everything is working correctly you should now see this as your prompt:
tacc:~$
If you see anything besides "tacc:~$
", get my attention and be ready to share your screen rather than continuing forward.
Setting up other shortcuts:
In order to make navigating to the different file systems on lonestar a little easier ($SCRATCH and $WORK), you can set up some shortcuts with these commands that create folders that "link" to those locations. Run these commands when logged into lonestar with a terminal, from your home directory.
cdh ln -s $SCRATCH scratch ln -s $WORK work ln -s $BI BioITeam
Several people report seeing an error message stating "ln: failed to create symbolic link 'BioITeam/BioITeam': Permission denied."
This is being investigated, but is not expected to impact today's tutorial.
Understanding what your .bashrc file actually does.
Editing files
There are a number of options for editing files at TACC. These fall into three categories:
- Linux text editors installed at TACC (nano, vi, emacs). These run in your terminal window. vi and emacs are extremely powerful but also quite complex, so nano is the best choice as a first local text editor. If you are already familiar with one of the other programs you are welcome to continue using it.
- Text editors or IDEs that run on your local computer but have an SFTP (secure FTP) interface that lets you connect to a remote computer (Notepad++ or Komodo Edit). Once you connect to the remote host, you can navigate its directory structure and edit files. When you open a file, its contents are brought over the network into the text editor's edit window, then saved back when you save the file.
- Software that will allow you to mount your home directory on TACC as if it were a normal disk e.g. MacFuse/MacFusion for Mac, or ExpanDrive for Windows or Mac ($$, but free trial). Then, you can use any text editor to open files and copy them to your computer with the usual drag-drop.
We'll go over nano
together in class, but you may find these other options more useful for your day-to-day work so feel free to go over these sections in your free time to familiarize yourself with their workings to see if one is better for you.
As we will be using nano throughout the class, it is a good idea to review some of the basics. nano is a very simple editor available on most Linux systems. If you are able to use ssh, you can use nano. To invoke it, just type:
nano
You'll see a short menu of operations at the bottom of the terminal window. The most important are:
- ctl-o - write out the file
- ctl-x - exit nano
You can just type in text, and navigate around using arrow keys. A couple of other navigation shortcuts: - ctl-a - go to start of line
- ctl-e - go to end of line
Be careful with long lines – sometimes nano will split long lines into more than one line, which can cause problems in our commands files, and if you copy paste code into a nano editor.
What can you do to see contents of a file without opening it for editing?
Command | useful for | bad if |
---|---|---|
head | seeing the first lines of a file (10 by default) | file is binary |
tail | seeing the last lines of a file (10 by default) | file is binary |
cat | print all lines of a file to the screen | the file is big and/or binary |
less | opens the entire file in a separate program but does not allow editing | if you are going to type a new command based on the content, or forget the q key exits the view, or file is binary |
more | prints 1 page worth of a file to the screen, can hold enter key down to see next line repeatedly. Contents will remain when you scroll back up. | you forget that you hit the q key to stop stop looking at the file, or file is binary |
Note that all of the above state that it is bad to view binary files. Binary files exist for computers to read, not humans, and are thus best ignored. We'll go over this in more detail as well as some conversion steps when we deal with .sam and .bam files later in the course.
How should we name files and folders?
In general you will want to adopt a consistent pattern of naming, and it should be your own and something that makes sense to you. After that there are some tips:
- The most important thing to get used to is the convention of using . _ or capitalizing the first letter in each word in names rather than spaces in names, and limiting your use of any other punctuation. Spaces are great for mac and windows folder names when you are using visual interfaces, but on the command line, a space is a signal to start doing something different. Imagine instead of a BioITeam folder you wanted to make it a little easier to read and wanted to call it "Bio I Team" certainly everyone would agree its easier to read that way, but because of the spaces, bash will think you want to create 3 folders, 1 named Bio another named I and a third named Team. Now this is certainly behavior you can use when appropriate to your advantage, but generally speaking spaces will not be your friend. Early on in my computational learning I was told "A computer will always do exactly what you told it to do. The trick is telling it to do what you want it to do".
- Name things something that makes it obvious to you what the contents are not just today but next week, next month, and next year even if you don't touch the it for weeks-months-years.
Stringing commands together and controlling their output
In a linux shell, it is often useful to take output of one command save it to a new file rather than having it print to the screen. It uses a familiar metaphor: "pipes". The linux operating system expects some "standard input pipe" and gives output back through a "standard output pipe". These are called "stdin" and "stdout" in linux. There's also a special "stderr" for errors; we'll ignore that for now. Usually, your shell is filling the operating system's stdin with stuff you type - the commands with options. The shell passes responses back from those commands to stdout, which the shell usually dumps to your screen. The ability to switch stdin and stdout around is one of the key reasons linux has existed for decades and beat out many other operating systems. Let's start making use of this. Change to the scratch directory and make a new folder called "piping" and put list of the full contents of the $BI folder to a new file called whatisHere.
cds mkdir piping ls -1 $BI > whatisHere cat whatisHere
When you execute the ls -1 > whatisHere
command, you should have noticed nothing happened on the screen, and when you cat the whatisHere file, you should notice the output you would have expected from the ls -1 > whatisHere
command. Often it is useful to chain commands together using the output of the first command as the input of the second command. Commands are chained together using the "|" character (shift \ above the return key). Use redirection to put the first 2 lines of the $BI directory contents into the whatisHere
file.
Again, you should see your answer only showing up after the cat command. Note that by using a single > you are overwriting the existing contents. This is now your second warning that there is no warning that a file is about to be deleted, also remember linux doesn't have an "undo" features or trash/recycle bin functionality you may be used to from mac/windows. We will make use of the redirect output (stdout) character (>
)
, and the "pass output along as input" "|" throughout the course. Not all shells are equal - the bash shell lets you redirect stdout with either >
or 1>
; stderr can be redirected with 2>
; you can redirect both stdout and stderr using &>
. If these don't work, use google to try to figure it out. The web site stackoverflow is a usually trustworthy and well annotated site for OS and shell help.
Understanding TACC
Now that we've been using lonestar for a little bit, and have it behaving in a way that is a little more useful to us, let's get more of a functional understanding of what exactly it is and how it works.
Diagram of Lonestar5 directories: What connects to what, how fast, and for how long.
Lonestar is a collection of 1,252 computers with 24 cores connected to three file servers, each with unique characteristics. You need to understand the file servers to know how to use them effectively.
$HOME | $WORK | $SCRATCH | |
---|---|---|---|
Purged? | No | No | Files can be purged if not accessed for 10 days. |
Backed Up? | Yes | No | No |
Capacity | 5GB | 1TB | Basically infinite. |
Commands to Access | cdh cd $HOME/ | cdw cd $WORK/ | cds cd $SCRATCH/ |
Purpose | Store Executables | Store Files and Programs | Run Jobs |
Executables that aren't available on TACC through the "module" command should be stored in $HOME.
If you plan to be using a set of files frequently or would like to save the results of a job, they should be stored in $WORK.
If you're going to run a job, it's a good idea to keep your input files in a directory in $WORK and copy them to a directory in $SCRATCH where you plan to run your job.
cp $WORK/my_fastq_data/*fastq $SCRATCH/my_project/
Understanding "jobs" and compute nodes.
When you log into lonestar using ssh you are connected to what is known as the login node or "the head node". There are several different head nodes, but they are shared by everyone that is logged into lonestar (not just in this class, or from campus, or even from Texas, but everywhere in the world). Anything you type onto the command line has to be executed by the head node. The longer something takes to complete, or the more commands you send at once the slower the head node will work for you and everybody else. Get enough people running large jobs on the head node all at once (say several class rooms full of summer school students) and lonestar can actually crash leaving nobody able to execute commands or even log in for minutes -> hours -> perhaps even days if something goes really wrong. To try to avoid crashes, TACC tries to monitor things and proactively stop things before they get too out of hand. If you guess wrong on if something is safe to run on the head node, you may eventually see a message like the one pasted below. If you do, it's not the end of the world, but repeated messages will lead to revoked TACC access and emails where you have to explain what you are doing to TACC and your PI and how you are going to fix it and avoid it in the future.
Message from root@login1.ls4.tacc.utexas.edu on pts/127 at 09:16 ... Please do not run scripts or programs that require more than a few minutes of CPU time on the login nodes. Your current running process below has been killed and must be submitted to the queues, for usage policy see http://www.tacc.utexas.edu/user-services/usage-policies/ If you have any questions regarding this, please submit a consulting ticket.
So you may be asking yourself what the point of using lonestar is at all if it is wrought with so many issues. The answer comes in the form of compute nodes. There are 1,252 compute nodes that can only be accessed by a single person for a specified amount of time. These compute nodes are divided into different queues called: normal, development, largemem, etc. Access to nodes (regardless of what queue they are in) is controlled by a "Queue Manager" program. You can personify the Queue Manager program as: Heimdall in Thor, a more polite version of Gandalf in lord of the rings when dealing with with the balrog, the troll from the billy goats gruff tail, or any other "gatekeeper" type. Regardless of how nerdy your personification choice is, the Queue Manager has an interesting caveat: you can only interact with it using the sbatch command. "sbatch <filename.slurm>" tells the que manager to run a set job based on information in filename.slurm (i.e. how many nodes you need, how long you need them for, how to charge your allocation, etc). The Queue manager doesn't care WHAT you are running, only HOW to find what you are running (which is specified by a setenv CONTROL_FILE commands
line in your filename.slurm file). The WHAT is then handled by the file "commands" which contains what you would normally type into the command line to make things happen.
Further sbatch reading
To make things easier on all of us, there is a script called launcher_creator.py that you can use to automatically generate a .slurm file. This can all be summarized in the following figure:
Using launcher_creator.py
The BioITeam created a Python script called launcher_creator.py
that makes creating a .slurm file a breeze. Before learning to work with interactive compute nodes during the class, we will show you how you will most often do your analysis. Run the launcher_creator.py script with the -h
option to show the help message so we can see what other options the script takes:
Short option | Long option | Required | Description |
-n | name | Yes | The name of the job. |
-t | time | Yes | Time allotment for job, format must be hh:mm:ss. |
-j | job | j and/or b must be used | Filename of list of commands to be distributed to nodes. |
-b | bash commands | j and/or b must be used | Optional String of Bash commands to execute before commands are distributed |
-q | queue | Default: Development | The queue to submit to, like 'normal' or 'largemem', etc. |
-a | allocation | The allocation you want to charge the run to. | |
-m | modules | Optional String of module management commands semi colon separated. ie "module load bowtie2; module load fastqc" | |
-w | wayness | Optional The number of jobs in a job list you want to give to each node. (Default is 12 for Lonestar, 16 for Stampede.) | |
-N | number of nodes | Optional Specifies a certain number of nodes to use. You probably don't need this option, as the launcher calculates how many nodes you need based on the job list (or Bash command string) you submit. It sometimes comes in handy when writing pipelines. | |
-e | Optional Your email address if you want to receive an email from Lonestar when your job starts and ends. | ||
-l | launcher | Optional Filename of the launcher. (Default is | |
-s | stdout | Optional Setting this flag outputs the name of the launcher to stdout. |
The lines highlighted in green and yellow are what you should focus on:
- Job: Not required by the script, but likely always will be. This file has a list of commands that you want to run, and is probably biggest part of what makes tacc great... letting you run a large number of commands all at once rather than having to run them 1 at a time on your own computer.
- Time and name: They are required so obviously important for that reason. Also names are important as mentioned above naming everything 'tacc_job' will make your life much more confusing at some point.
- Wayness: 12 is the default, 48 is the max on lonestar.
- This is a balance of memory and to a lesser extent speed against how long you want to wait for your job to run and SU cost.
- Imagine you have 96 commands you want to run. 12, 24, and 48 as the "w" option will require 8, 4, and 2 remote computers to run at the same time, 2 computers are typically available sooner than a set of 4 and 4 are available sooner than a set of 8. So your job is likely to spend less time 'in the queue' with larger numbers.
- We'll go over "SUs" near the end of the course, but for now it is sufficient to say they are a currency for tacc, and that smaller numbers will cost more SUs
- For bacterial work, 48 is a much better choice in nearly all situations, unless you are experiencing problems or know you have a massive amount of data.
- The downside is if you pick a number that is too small, your commands may have errors and not actually produce results requiring you to start over with longer requests, or smaller "w" values
- queue: Development and normal are the 2 you are likely to deal with most often. "Normal" is a better choice unless you are developing new code or have odd turn around time requirements.
Running a job
If a comment has been made over zoom that we'll be starting the second presentation soon
Consider skipping the rest of the tutorials on this page, and jump over to this tutorial on transferring files between tacc and your local computer. You will have plenty of time to come back to this tutorial on Wednesday/Thursday/Friday when we begin working with the specialized tutorials, as well as running jobs on tacc being covered in its own tutorial on the last half of Friday's course.
Now that we have an understanding of what the different parts of running a job is, let's actually run a job. Move to your scratch directory, make a new folder called "my_first_job" (Remember not to use spaces in file/folder names), make a new file called "commands" inside of that directory using nano, and put 4-12 lines with 1 command on each line in that file, being sure to remember to pipe the output to 1 or more files.
# remember that things after the # sign are ignored by bash # lines in code blocks often will scroll to the right if you have a narrow browser window cds # move to your scratch directory mkdir my_first_job # make a new folder called "my_first_job" cd my_first_job # move into the new folder to make it easier to create a file there nano commands
cat commands > commands.out # this will print the contents of the file you are currently editing to a new file called commands.out date > date.out # this will create a file with todays date on it pwd > current_directory.out # this will create a file with the current directory in it echo "my name is <YOURNAME>" >> name.out # Note that this time we used the append symbol >> not the write symbol > as we plan to put multiple things into the same file. Be sure to replace the <> signs with your name echo "This is the final result of my first script. It worked how I thought it would, or hopefully have the resources to figure out why it didn't" >> name.out # this will add another line of text to the name.out file. # feel free to add up to 7 more lines to your commands file here using the cat/ls/pwd/mkdir/other commands that you know. # beware using cd commands here as it will change your directory as if you were doing it on an interactive node and may cause you to reference files that don't exist
Best practice consideration for working with nano
In the next code box you see the top line is commented out but says to hit 'ctrl-o' 'ctrl-x' to write and exit nano.
Since files that you open with nano are able to be edited immediately, it is a good idea to get in the habit of only saving files when you explicitly know you meant to edit them with the ctrl-o command (control + o) and then when you hit ctrl-x (control + x) nano exits gracefully.
Conversely, if you open a file with nano with the intent of just looking at it or decide not to make any changes, or want to get rid of all your changes, you can hit ctrl-x and exit nano without saving the changes.
If you instead choose to exit nano with ctrl-x and then select 'save' you risk building a habit of always saving when you exit and thus may introduce edits to your files you didn't mean to.
# write and exit nano now ctrl-o ctrl-x launcher_creator.py -n "my_first_job" -j commands -t 00:02:00 -a "UT-2015-05-18" # this will create a my_first_job.slurm file that will run for 2 minutes sbatch my_first_job.slurm # this will actually submit the job to the Queue Manager and if everything has gone right, it will be added to the development queue.
Interrogating the launcher queue
Here are some of the common commands that you can run and what they will do or tell you:
Command | Purpose | Output(s) |
---|---|---|
showq -u | Shows only your jobs | Shows all of your currently submitted jobs, a state of: "qw" means it is still queued and has not run yet "r" means it is currently running |
scancel <job-ID> | Delete a submitted job before it is finished running note: you can only get the job-ID by using showq -u | There is no confirmation here, so be sure you are deleting the correct job. There is nothing worse than deleting a job that has sat a long time by accident because you forgot something on a job you just submitted. |
showq | You are a nosy person and want to see everyone that has submitted a job | Typically a huge list of jobs, and not actually informative |
If the queue is moving very quickly you may not see much output, but don't worry, there will be plenty of opportunity once you are working on your own data.
Evaluating your first job submission
Based on our example you may have expected 4 new files to have been created during the job submission, but instead you will find 3 extra files as follows: <job_name>.e(job-ID), <job_name>.pe(job-ID), and <job_name>.o(job-ID). When things have worked well, these files are typically ignored. When your job fails, these files offer insight into the why so you can fix things and resubmit.
Many times while working with NGS data you will find yourself with intermediate files. Two of the more difficult challenges of analysis can be trying to decide what files you want to keep, and remembering what each intermediate file represents. Your commands files can serve as a quick reminder of what you did so you can always go back and reproduce the data. Using arbitrary endings (.out in this case) can serve as a way to remind you what type of file you are looking at. Since we've learned that the scratch directory is not backed up and is purged, see if you can turn your 4 intermediate files into a single final file using the cat command, and copy the new final file, the .slurm file you created, and the 3 extra files to your work directories. This way you should be able to come back and regenerate all the intermediate files if needed, and also see your final product.
# remember that things after the # sign are ignored by bash cat *.out > first_job_submission.final.output # Remember that the * wildcard will take things in alpha order, if you want you can list each file separately to control what order they go into the new file. mkdir $WORK/GVA_2020 mkdir $WORK/GVA_2020/Day1 mkdir $WORK/GVA_2020/Day1/first_tacc_job # each directory must be made in order to avoid getting a no such file or directory error cp first_job_submission.final.output $WORK/GVA_2020/Day1/first_tacc_job cp *.slurm $WORK/GVA_2020/Day1/first_tacc_job cp *<job-ID> $WORK/GVA_2020/Day1/first_tacc_job #your job-id is the string of numbers following the .o and .e filenames
Transferring files to and from lonestar with scp
Most years, a small tutorial on transferring files between lonestar and your local computer. Given this year's reliance on zoom, and the number of students who have a hard time with the scp command, this tutorial has been moved to a its own tutorial page so that it can be more referenced more easily when files are to be transferred in future tutorials. Consider moving beyond the cook book tutorial that is provided and instead of transferring the README file from the BioITeam, transfer the first_job_submission.final.output file you created above. Once done with the transfer tutorial come back to this page to install a few extra programs and learn about the module system.
scp tutorial page.
Moving beyond the preinstalled commands on TACC
If (or when) you looked at what our edits to the .bashrc file did, you would have seen that section 1 has a series of "module load XXXX
" commands, and a promise to talk more about them later. I'm sure you will be thrilled to learn that now is that time... As a "classically trained wet-lab biologist" one of the most difficult things I have experienced in computational analysis has been in installing new programs to improve my analysis. Programs and their installation instructions tend (or appear) to be written by computational biologists in what at times feels like a foreign language, particularly when things start going wrong. Luckily TACC (and the BioITeam) help get around a large number of these problems by preinstalling many programs if you know where to look.
After explaining the module system which we will use extensively throughout the course, we'll install 3 separate programs that we may use later in the class via 3 different means. This is an incomplete list of ways to install new programs to use, but is meant to be a good working example that you can adapt to install other programs in your future work. If you choose to do one of the optional tutorials that involve the programs installed here the program installation will be covered in more detail at that time.
TACC modules
Modules are programs or sets of programs that have been set up to run on TACC. They make managing your computational environment very easy. All you have to do is load the modules that you need and a lot of the advanced wizardry needed to set up the linux environment has already been done for you. New commands just appear.
To see all modules available in the current context, type:
module avail
Remember you can hit the "q" key to exit out of the "more" system, or just keep hitting return to see all of the modules available. The "module avail
" command is not the most useful of commands if you have some idea of what you are looking for. For example imagine you want to align a few million next generation sequencing reads to a genome, but you don't know what your options are. You can use the following command to get a list of programs that may be useful:
module keyword alignment
Note that this may not be an inclusive list as it requires the name of the program, or its description to contain the word "alignment". Looking through the results you may notice some of the programs you already know and use for aligning 2 sequences to each other such as blast and clustalw. Try broadening your results a little by searching for "align" rather than "alignment" to see how important word choice is. When you compare the two sets of results you will see that one of the new results is:
bowtie: bowtie/2.3.4 Memory-efficient short read (NGS) aligner
This may sound much better, but you still only have limited information about it. To learn more about a particular program, try the following 2 commands:
module spider bowtie module spider bowtie/2.3.4
In the first case, we see information about what versions of bowtie lonestar has available for us, but really that is just the same information as we had from our previous search. This can be particularly useful when you know what program you want to use but don't know what versions are available. In the second case we now have more detailed information about the particular version of interest including websites we can go to to learn more about the program itself.
Once you have identified the module that you want to use, you install it using the following command:
module load bowtie/2.3.4
Using the version numbers for module commands
While not strictly necessary, using the "/2.3.4
" text is a very good habit to get into as it controls what version is to be loaded. In this case the "2.3.4" version is the only (and thus default version) and module load bowtie
will behave identically to module load bowtie/2.3.4
but that will not always be the case, particularly if in the future TACC installs a new version of bowtie.
While it is tempting to only use "module load name" without the version numbers, using the version numbers can help keep track of what versions were used for referencing in your future publications, and make it easier to identify what went wrong when scripts that have been working for months or years suddenly stop working (ie TACC changed the default version of a program you are using).
Since the module load command doesn't give any output, it is often useful to check what modules you have installed with either of the following commands:
module list module list bowtie
The first example will list all currently installed modules while the second will only list modules containing bowtie in the name. If you see that you have installed the wrong version of something, a module is conflicting with another, or just don't feel like having it turned on anymore, use the following command:
module unload bowtie
You will notice when you type module list you have several different modules installed already. These come from both TACC defaults (TACC, linux, etc), and several that are used so commonly both in this class and by biologists that it becomes cumbersome to type "module load python3
" all the time and therefore we just have them turned on by default by putting them in our profile to load on startup. As you advance in your own data analysis you may start to find yourself constantly loading modules as well. When you become tiered of doing this (or see jobs fail to run because the modules that load on the compute nodes are based on your .bashrc file plus commands given to each node), you may want to add additional modules to your .bashrc file. This can be done using the "nano .bashrc" command from your home directory.
Downloading from the web directly to tacc
This is about using the wget
command. wget stands for Web get is a simple way of downloading a file from a web address to your current directory. Typically this makes use of the "Copy Link Address" option when you right click on a link in a web browser that would otherwise start a download to your computer.
Here we will install the trimmomatic read trimming tool. As we will mention in our next tutorial, trimmomatic is a very robust trimming tool that can be integrated into a standard analysis pipeline, and one of the optional tutorials will go over its use.
In a new web browser window/tab, navigate to the trimmomatic home page. Trimmomatic is far above average for as far as programs go, most will not have a user manual, may not have been updated since originally published, etc. This is what makes it such a good tool. For now we will focus on Downloading Trimmomatic section; right click on the 'binary' link for version 0.39 and copy that link address.
Which to choose binary files or uncompiled source code
Use the wget command to download the link you just copied to a new folder named src in your $WORK directory.
If you already have a src directory, you'll get a very benign error message stating that the folder already exists and thus can not be created.
You should see a download bar showing you the file has begun downloading, when complete the ls
command will show you a new compressed file named "Trimmomatic-0.39.zip". Next we need to uncompress this file, and copy the executable file to a location already in our $PATH variable.
unzip Trimmomatic-0.39.zip cd Trimmomatic-0.39 cp trimmomatic-0.39.jar $HOME/local/bin
If you don't see the zip file or are unable to cd into the 0.39 directory after unzipping it let the instructor know.
As this tutorial is focused only on downloading interesting programs you may read about, this is the final step. If you do the optional trimmomatic tutorial later in the course, we'll go over some of the nuances of trimmomatic and shortcuts to make it easier to use. This page will be updated to include a link to said tutorial later in the week.
wget alternative
It is always an alternative to download such files directly to your computer using a web browser and then use the scp command to transfer it to TACC. The wget command can help you avoid these intermediate steps and is more convenient most of the time unless you want to install the program on both your laptop and TACC, and have the same operating system on both.
Github
This is about using the git clone
command. Git is a command often used for collaborative program development or sharing of files. Some groups also put the programs or scripts associated with a particular paper on a github project and publish the link in their paper or on their lab website.
Here we will clone the github repository for breseq which is developed by the Barrick lab here at UT and is used to comprehensively analyze haploid microbial genomes to identify all variants present. In some of the initial tutorials everyone will use a version of breseq that is available through the BioITeam, in the optional tutorials you may compile your own copy of breseq from this github project to underscore why binary files are typically preferred, or as a way of easily staying up to date on new developments with the program itself.
Initially cloning a github repository as exceptionally similar to using the wget
command to download the repository, it involves typing 'git clone
' followed by a web address where the repository is stored. As we did for installing trimmomatic with wget we'll clone the repository into a 'src' directory inside of $WORK.
If you already have a src directory, you'll get a very benign error message stating that the folder already exists and thus can not be created.
In a web browser navigate to github and search for 'breseq' in the top right corner of the page. The top result will be for barricklab/breseq; click the green box for 'clone or download' and either control/command + C on the address listed, or click the clipboard icon to copy the repository address. This image may be helpful if you are are having trouble locating the green box.
You will see several download indicators increase to 100%, and when you get your command prompt back the ls
command will show a new folder named 'breseq' containing a set of files. If you don't see said directory, or can't cd into that directory let the instructor know.
As with Trimmomatic, these files will require additional work that is somewhat specific to the specific program and there for beyond the scope of this tutorial. A link to the advanced tutorials for getting your own copy of breseq up and running will be added later in the week.
pip
This is about using the pip3 install
command. pip is the standard package manager for the common programing language python. When labs put together new analysis programs/packages, increasingly they try to make these programs available for others to use via pip. pip3 rather than just pip references the specific version of python.
Here we will install the multiqc
analysis program which compiles reports from a program called fastqc
about the quality of fastq files from multiple different samples at one time. In the later portion of the class you may choose to work with this program to get a better overall view of multiple fastq files all at once rather than clicking through individual reports.
pip3 install --user multiqc
*note that the "--user" option in the above code is required while working on LS5 because individual users do not have access to core systems. If you have python3 on your personal computer and wanted to install multiqc (or any other package available through pip) you would typically omit the "--user" flag.
Installation may take a minute or two depending on your internet connection and you will see several progress bars. Eventually you should see a line that starts with "Successfully installed
" and then a long list of packages including multiqc-1.9. The additional packages listed are packages that multiqc will use to generate its figures.
which multiqc multiqc
The first line should return something that starts with /home1/
then has a number and your user id followed by /.local/bin/multiqc
. The second line should tell you that there is an error as you didn't provide an argument for the analysis directory as well as that you are using multiqc version 1.9. If you see other results, try this more complicated installation and recheck the installation:
If you still see something else, let the instructor know.
This concludes the the linux and lonestar refresher/introduction tutorial.
Genome Variant Analysis Course 2020 home.