Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This is probably the longest tutorial in the entire class. It is designed to take between 1/2 and 3/4 of the first class. Do not stress if you feel people are moving through it faster than you are, or if you do not get it done before the next presentation. There will be links back to this tutorial from other tutorials as needed, and by the 2nd half of Wednesday's class when we start with the specialized tutorials, you can circle back to this tutorial as well. 

Objectives:

  1. Familiarize yourself with the way course material will be presented.
  2. Log into lonestar5.
  3. Change your lonestar profile to the course specific format.
  4. Refresh understanding of basic linux commands with some course organization.
  5. Review use of the nano text editor program, and become familiar with several other text editor programs.

...

There will be 4 types of code blocks used throughout this class. Text inside of code blocks represent "right" answersat least 1 possible correct answer, and should either be typed EXACTLY into the terminal window as they are, or copy pasted with . There is a notable exception . Text that exist within text between <> symbols represent something that you need to replace before sending it to the terminal. Yes, the <> marks themselves also need to be replaced. We try to put informative text within the brackets so you know what to replace it with. If you are ever unsure of what to replace the <> text with, just ask.

  1. Visible
    1. These are code blocks that you would have no idea what to type without help. (like when a new command is being introduced)
    2. These will typically be associated with longer/more detailed text above the text box explaining things.
    3. An example code block showing you the command you need to type into the prompt to list what directory you are currently in:

      Code Block
      languagebash
      pwd
  2. Hinted
    1. These are code blocks that you can probably figure out what to type with a hint that goes beyond what the tutorial is requesting. Access the hint by clicking the triangle or hint hyperlink text.
    2. These exist to force you to think about what command you need, and hopefully make some connections to help you remember what you will need to type in the future.
    3. These should all come with additional explanation as to what is going on.
    4. Rather than just expanding these by reflex, I strongly suggest seeing if you can figure out what the command will be, and checking your work
    5. Example:

      Expand
      titlewhat command would you use to Print your current Working Directory

      In this example the letters P W and D are all capitalized to try to help you focus on the command itself

      Code Block
      languagebash
      pwd 
  3. Hidden:
    1. These code blocks represent things that you should have seen several times already, or things that can be succinctly explained.
    2. Example:

      Code Block
      languagebash
      titleuse the pwd command to print your current working directory
      collapsetrue
      pwd
  4. Speed bump:

    1. This combines the previous 2 types to deliberately slow you down and be cumbersome. 
    2. If you find yourself consistently wrong about what eventually shows up in the text box, slow down, step back, think about whats going on, and consider asking a question.
    3. These should only come after you have seen the same (or very similar) commands in the other formats previously
    4. Example:

      Expand
      titleprint your current working directory

      Remember, the command you need is "pwd".

      Code Block
      languagebash
      titleThis command needs no options
      collapsetrue
      pwd

...

Info
titleUnderstanding why some files start with a "."

In the above code box, you see that the names start with a . when a filename starts with a . it conveys a special meaning to the operating system/command line. Specifically, it prevents that file from being displayed when you use the ls command unless you specifically as for hidden files to be displayed using the -a option. Such files are termed "dot-files" if you are interested in researching them further.

Let's look at a few different ways we will use the ls command throughout the course. Compare the output of the following 4 commands:

Code Block
languagebash
titleStandard output
ls              #ignore everything that comes after the # mark. There is a problem on this wiki page but things after a # wont effect commands
Code Block
languagebash
titleStandard output plus hidden files
ls -a
Code Block
languagebash
titleStandard output plus hidden files in a single column
ls -a -1
Code Block
languagebash
titleStandard output plus hidden files in a single column with additional information
ls -a -l

Throughout the course you will notice that many options are supplied to commands via a single dash immediately followed by a single letter. Usually when you have multiple commands supplied in this manner you can combine all the letters after a single dash to make things easier/faster to type. Experiment a little to prove to yourself that the following 2 commands give the same output.

Code Block
languagebash
titleStandard output plus hidden files in a single column
ls -a -1

ls -al

While knowing that you can combine options in this way helps you analyze data faster/better, the real value comes from being able to decipher commands you come across on help forums, or in publications.

For ls specifically the following association table is worth making note of, but if you want the 'official' names consider using the man command to bring up the ls manual.

flagassociation
-a"all" files
-l"long" listing of file information
-11 column

...

Code Block
languagebash
titleHow to leave Lonestar by logout or exit from a remote connection
collapsetrue
exitlogout
# or
logoutexit

then log back in:

Code Block
languagebash
titleGo log back in to Lonestar
collapsetrue
ssh <username>@ls5.tacc.utexas.edu

If everything is working correctly you should now see a this as your prompt like this:  

No Format
tacc:~$
Warning

If you see anything besides "tacc:~$", get my attention and be ready to share your screen rather than continuing forward.

...

Code Block
titleCreating a shortcut to the main Lonestar working directories
cdh
ln -s $SCRATCH scratch
ln -s $WORK work
ln -s $BI BioITeam

Several people report seeing an error message stating "ln: failed to create symbolic link 'BioITeam/BioITeam': Permission denied." This is being investigated, but is not expected to impact today's tutorial.

  • Understanding what your .bashrc file actually does.

Expand
titleWhile interesting and useful information to have, understanding it is not critical to variant analysis. I suggest you to look through this information after you complete the rest of the tutorial, in your free time, or when you need to modify your profile or bashrc files in the future.
Info

Let's look at what your .bashrc profile actually does. Use the cat command to print contents of the .bashrc file to the screen.

Code Block
collapse
languagebash
titlePrint the contents of the .profile file to the screentrue
cat .bashrc

This will print several lines of text to the terminal window. Let's look at what some of these lines do with a little more information:

  • lines that start with #

    • Any line begins with a # symbol, it is "commented out". Anything after a # symbol will not be executed by any program. Programers commonly make use of behavior to leave notes for others, or even themselves at a later date as to what particular lines of a script are actually doing.
  • Section 1 has multiple lines involving "module load <NAME>"

    • This loads different modules by default. We have included ones that we will use throughout the course and that you will commonly make use of. After we review the use of the nano text editor we'll go into more depth with TACC modules. But for now trust us when we say that not having to load a bunch of modules every time you log into TACC is a good thing.

  • Section 2 has multiple lines starting with "export"

    • The export lines define shell variables for example BI and PATH. You've already seen how using $BI can come in handy accessing our shared course directory. As for PATH, that is a well-known environment variable that defines a set of directories where the shell will look when you type in a program's name. Our shared profile adds the common course directories that we copied at the start of this tutorial and your local ~/local/bin directory (which does not exist yet) to the location list. You can see the entire list of locations by doing this:

      Code Block
      languagebash
      titleHow to see where the bash shell looks for programs
      echo $PATH

      As you can see, there are a lot of locations on the path. That's because when you load modules at TACC (see above), that mechanism makes the programs available to you by putting their installation directories on your $PATH.

  • umask 002

    • The umask command is used to set the default permissions of newly created files and directories limiting the need to use the chmod command. umask functions as the inverse of chmod meaning that it subtracts the values from the default permissions. In this case the command umask 002 is the equivalent of the command chmod 775 for directories, and chmod 664 for files. in summary, having this command in your .profile gives all new files you create read and write access to both you and your group while giving read only access to everyone else.
  • PS1='tacc:\w$ '

    • The PS1='tacc:\w$ ' line is a special setting that tells the shell to display the current directory as part of its prompt. It saves you typing pwd all the time to see where you are in the directory hierarchy. Try using the mkdir command to make a new directory called tmp and change into that directory to see what it does to your prompt.

      Code Block
      languagebash
      titleSee how your prompt reflects your current directory
      collapsetrue
      mkdir tmp
      cd tmp
    • Your prompt should have changed from: "tacc:~$"to now be "tacc:~/tmp$". Your prompt now tells you you are in the tmp subdirectory of your home directory (~). See if you can figure out how to return to your home directory without expanding the code block. Expand the following code block to see the different ways of returning to your home directory.

      Code Block
      languagebash
      titleHow to return to your home directory
      collapsetrue
      cd
      cdh
      cd $HOME
      cd ~
      cd -

      The last example in the above code block will return you to your previous directory. In this case, that means the home directory, but it can be very useful in other situations when you change directories to do something in 1 place then need to hop back to where you were, or if you mistakenly leave a directory.

...

  • Linux text editors installed at TACC (nanoviemacs). These run in your terminal window. vi and emacs are extremely powerful but also quite complex, so nano may be  is the best choice as a first local text editor. Text editors If you are already familiar with one of the other programs you are welcome to continue using it.
  • Text editors or IDEs that run on your local computer but have an SFTP (secure FTP) interface that lets you connect to a remote computer (Notepad++ or Komodo Edit). Once you connect to the remote host, you can navigate its directory structure and edit files. When you open a file, its contents are brought over the network into the text editor's edit window, then saved back when you save the file.
  • Software that will allow you to mount your home directory on TACC as if it were a normal disk e.g. MacFuse/MacFusion for Mac, or ExpanDrive for Windows or Mac ($$, but free trial). Then, you can use any text editor to open files and copy them to your computer with the usual drag-drop.

...

Expand
MacFuse/MacFusion/TextWrangler for Mac
MacFuse/MacFusion/TextWrangler for Mac

Want your Lonestar Lonestar5 files to appear like any other place on your hard drive? You can do this using MacFuse/MacFusion on a Mac.

Want to edit files on TACC without having to use nano? You might want to use TextWrangler, a text editor that can edit files over ssh.

Editing Text Files on TACC: TextWrangler

TextWrangler is a recommended FreeWare text editor for MacOS X. (It even keeps with the theme TACC has going with naming its clusters!) You can use it to directly edit text files on Lonestar with OSXFuse/MacFusion using a nice GUI. It is a much more powerful text editor than TextEdit, and won't trip you up by wrapping lines etc., if you learn to use it.

Even if you cannot install OSXFuse/MacFusion, TextWrangler allows you to edit a remote file via SSH. To do this:

  1. Select *File > Open from FTP/SFTP Server...
  2. Type ls5.tacc.utexas.edu, your username, and your password into the appropriate boxes.
  3. Check the You need to check the SFTP box.
  4. Click connect.
  5. You will now have a file browser window. You can create new files and edit existing files on lonsetar, but won't be able to drag-and-drop copy files.

Tip: Files beginning in a dot (.) like (.profile_userbashrc) are "hidden" and won't show up when you are navigating in Finder (if using OSXFuse/MacFusion). There is a way to turn on showing these files in finder, but it can get annoying because they will show up everywhere. If you use the TextWrangler "open" command to open a file, there is a box that you can check to show these files.

Connecting to TACC Like a Hard Drive: MacFuse/MacFusion

Here are the steps for an installation:

  1. Download and install FUSE for OS X.
    • Check the option to install the "compatibility layer"
  2. Download MacFusion.
    • Move the app that gets downloaded to your Applications folder
  3. Restart your computer.
  4. Open the MacFusion application.
  5. Click the + menu in the window and select SSHFS. Enter your login information for lonestar. Choose connect. The remote file system will appear in Finder (depending on your settings it may be on the desktop or inside the computer shortcut in the side of a Finder window). You can also click on the mounted volume within MacFusion and choose "Reveal" from the gear menu.

Copying Files To and From TACC: SFTP Clients

If you can't get OSXFuse/MacFusion to work, you can still copy files back and forth between your computer and TACC using a secure FTP (SFTP) client. Some examples of free programs for Mac are:

...

  1. The most important thing to get used to is the convention of using . or _ in  _  or capitalizing the first letter in each word in names rather than spaces in names, and limiting your use of any other punctuation. Spaces are great for mac and windows folder names when you are using visual interfaces, but on the command line, a space is a signal to start doing something different. Imagine instead of a BioITeam folder you wanted to make it a little easier to read and wanted to call it "Bio I Team" certainly everyone would agree its easier to read that way, but because of the spaces, bash will think you want to create 3 folders, 1 named Bio another named I and a third named Team. Now this is certainly behavior you can use when appropriate to your advantage, but generally speaking spaces will not be your friend. Early on in my computational learning I was told "A computer will always do exactly what you told it to do. The trick is telling it to do what you want it to do". 
  2. Name things something that makes it obvious to you what the contents are not just today but next week, next month, and next year even if you don't touch the it for weeks-months-years.

...

When you execute the ls -1 > whatisHere command, you should have noticed nothing happening happened on the screen, and when you cat the whatisHere file, you should notice the output you would have expected from the ls -1 > whatisHere command. Often it is useful to chain commands together using the output of the first command as the input of the second command. Commands are chained together using the "|" character (shift \ above the return key). Use redirection to put the first 2 lines of the $BI directory contents into the whatisHere file.

...

Expand
titleDo you think 'whatisHere' is a good name for a file considering what the information above about naming files/folders names that will make sense later?

Obviously not, "Here" is ambiguous, and "whatis" doesn't immediately tell you that its actually a list of directory contents.

Code Block
languagebash
titleExample move (mv) commands to rename the file something better
mv whatisHere BioIteam_contents
mv whatisHere BioIteam_directory_contents
mv whatisHere BioIteam_directory_contents_2020-06

This is what i I would consider a good better best improvement. Yes the last one is particularly long, but almost guaranteed that you will know what exactly what that file is no matter when the next time you see it is.


  • Understanding TACC

Now that we've been using lonestar for a little bit, and have it behaving in a way that is a little more useful to us, let's get more of a functional understanding of what exactly it is and how it works.

...

Code Block
languagebash
titleExample command for copying data from a $WORK directory to $SCRATCH . This command is only an example of something you may use in the future. As you do not have any fastq files on $WORK, or at least likely do not have them in a folder titled 'my_fastq_data' if you tried this command you would be expected to get a message stating no such file or directory found.
 cp $WORK/my_fastq_data/*fastq $SCRATCH/my_project/

...

  1. Job: Not required by the script, but likely always will be. This file has a list of commands that you want to run, and is probably biggest part of what makes tacc great... letting you run a large number of commands all at once rather than having to run them 1 at a time on your own computer.
  2. Time and name: They are required so obviously important for that reason. Also names are important as mentioned above naming everything 'tacc_job' will make your life much more confusing at some point.
  3. Wayness: 12 is the default, 48 is the max on lonestar. 
    1. This is a balance of memory and to a lesser extent speed against how long you want to wait for your job to run and SU cost. 
    2. Imagine you have 96 commands you want to run. 12, 24, and 48 as the "w" option will require 8, 4, and 2 remote computers to run at the same time, 2 computers are typically available sooner than a set of 4 and 4 are available sooner than a set of 8. So your job is likely to spend less time 'in the queue' with larger numbers.
    3. We'll go over "SUs" near the end of the course, but for now it is sufficient to say they are a currency for tacc, and that smaller numbers will cost more SUs
    4. For bacterial work, 48 is a much better choice in nearly all situations, unless you are experiencing problems or know you have a massive amount of data.
    5. The downside is if you pick a number that is too small, your commands may have errors and not actually produce results requiring you to start over with longer requests, or smaller "w" values
  4. queue: Development and normal are the 2 you are likely to deal with most often. normal "Normal" is a better choice unless you are developing new code .or have odd turn around time requirements.

Running a job

Tip
titleIf a comment has been made over zoom that we'll be starting the second presentation soon

Consider skipping the rest of the tutorials on this page, and jump over to this tutorial on transferring files between tacc and your local computer. You will have plenty of time to come back to this tutorial on Wednesday/Thursday/Friday when we enter the choose your own adventure portion of the coursebegin working with the specialized tutorials, as well as running jobs on tacc being covered in its own tutorial on the last half of Friday's course.

The file transfer tutorial can be found here.

...

Code Block
languagebash
titlehow to make a sample commands file
linenumberstrue
# remember that things after the # sign are ignored by bash 
# lines in code blocks like this often will scroll to the right if you have a narrow browser window
cds  # move to your scratch directory
mkdir my_first_job  # make a new folder called "my_first_job"
cd my_first_job  # move into the new folder to make it easier to create a file there
nano commands  
Code Block
languagebash
titleThe following lines should be typed into the nano editor so they will be saved to the new file "commands"
linenumberstrue
cat commands > commands.out  # this will print the contents of the file you are currently editing to a new file called commands.out
date > date.out  # this will create a file with todays date on it
pwd > current_directory.out  # this will create a file with the current directory in it
echo "my name is <YOURNAME>" >> name.out  # Note that this time we used the append symbol >> not the write symbol > as we plan to put multiple things into the same file. Be sure to replace the <> signs with your name
echo "This is the final result of my first script. It worked how I thought it would, or hopefully have the resources to figure out why it didn't" >> name.out  # this will add another line of text to the name.out file.
# feel free to add up to 7 more lines to your commands file here using the cat/ls/pwd/mkdir/other commands that you know.
# beware using cd commands here as it will change your directory as if you were doing it on an interactive node and may cause you to reference files that don't exist
Code Blockinfo
languagebash
titleSave the changes you made to the commands file, and submit your first job
linenumberstrue
# write and exit nano now ctrl-o ctrl-x
launcher_creator.py -n "my_first_job" -j commands -t 00:02:00 -a "UT-2015-05-18" # this will create a my_first_job.slurm file that will run for 2 minutes
sbatch my_first_job.slurm  # this will actually submit the job to the Queue Manager and if everything has gone right, it will be added to the development queue.

Interrogating the launcher queue

Here are some of the common commands that you can run and what they will do or tell you:

...

Shows all of your currently submitted jobs, a state of:

"qw" means it is still queued and has not run yet

"r" means it is currently running

...

Delete a submitted job before it is finished running

note: you can only get the job-ID by using showq -u

...

There is no confirmation here, so be sure you are deleting the correct job.

There is nothing worse than deleting a job that has sat a long time by accident because you forgot something on a job you just submitted.

...

titleBest practice consideration for working with nano

In the next code box you see the top line is commented out but says to hit 'ctrl-o' 'ctrl-x' to write and exit nano.

Since files that you open with nano are able to be edited immediately, it is a good idea to get in the habit of only saving files when you explicitly know you meant to edit them with the ctrl-o command (control + o) and then when you hit ctrl-x (control + x) nano exits gracefully.

Conversely, if you open a file with nano with the intent of just looking at it or decide not to make any changes, or want to get rid of all your changes, you can hit ctrl-x and exit nano without saving the changes.

If you instead choose to exit nano with ctrl-x and then select 'save' you risk building a habit of always saving when you exit and thus may introduce edits to your files you didn't mean to.

Code Block
languagebash
titleSave the changes you made to the commands file, and submit your first job
linenumberstrue
# write and exit nano now ctrl-o ctrl-x
launcher_creator.py -n "my_first_job" -j commands -t 00:02:00 -a "UT-2015-05-18" # this will create a my_first_job.slurm file that will run for 2 minutes
sbatch my_first_job.slurm  # this will actually submit the job to the Queue Manager and if everything has gone right, it will be added to the development queue.

Interrogating the launcher queue

Here are some of the common commands that you can run and what they will do or tell you:

CommandPurposeOutput(s)
showq -uShows only your jobs

Shows all of your currently submitted jobs, a state of:

"qw" means it is still queued and has not run yet

"r" means it is currently running

scancel <job-ID>

Delete a submitted job before it is finished running

note: you can only get the job-ID by using showq -u

There is no confirmation here, so be sure you are deleting the correct job.

There is nothing worse than deleting a job that has sat a long time by accident because you forgot something on a job you just submitted.

showqYou are a nosy person and want to see everyone that has submitted a jobTypically a huge list of jobs, and not actually informative

If the queue is moving very quickly you may not see much output, but don't worry, there will be plenty of opportunity once you are working on your own data.

...

If (or when) you looked at what our edits to the .bashrc file did, you would have seen that section 1 has a series of "module load XXXX" commands, and a promise to talk more about them later. I'm sure you will be thrilled to learn that now is that time... As a "classically trained wet-lab biologist" one of the most difficult things I have experienced in computational analysis has been in installing new programs to improve my analysis. Programs and their installation instructions tend (or appear) to be written by computational biologists in what at times feels like a foreign language, particularly when things start going wrong. Luckily TACC (and the BioITeam) help get around a large number of these problems by preinstalling many programs if you know where to look.

After explaining the module system which we will use extensively throughout the course, we'll install 3 separate programs that we may use later in the class via 3 different means. This is an incomplete list of ways to install new programs to use, but is meant to be a good working example that you can adapt to install other programs in your future work.

TACC modules

Modules are programs or sets of programs that have been set up to run on TACC. They make managing your computational environment very easy. All you have to do is load the modules that you need and a lot of the advanced wizardry needed to set up the linux environment has already been done for you. New commands just appear.

To see all modules available in the current context, type:

Code Block
languagebash
titlelist all modules
 module avail

Remember you can hit the "q" key to exit out of the "more" system, or just keep hitting return to see all of the modules available. The "module avail" command is not the most useful of commands if you have some idea of what you are looking for. For example imagine you want to align a few million next generation sequencing reads to a genome, but you don't know what your options are. You can use the following command to get a list of programs that may be useful:

Code Block
languagebash
titleList all modules containing a particular term
module keyword alignment

Note that this may not be an inclusive list as it requires the name of the program, or its description to contain the word "alignment". Looking through the results you may notice some of the programs you already know and use for aligning 2 sequences to each other such as blast and clustalw. Try broadening your results a little by searching for "align" rather than "alignment" to see how important word choice is. When you compare the two sets of results you will see that one of the new results is:

Code Block
bowtie: bowtie/2.3.4
	Memory-efficient short read (NGS) aligner

 This may sound much better, but you still only have limited information about it. To learn more about a particular program, try the following 2 commands:

Code Block
languagebash
titleGet more information on particular module
module spider bowtie

module spider bowtie/2.3.4

In the first case, we see information about what versions of bowtie lonestar has available for us, but really that is just the same information as we had from our previous search. This can be particularly useful when you know what program you want to use but don't know what versions are available. In the second case we now have more detailed information about the particular version of interest including websites we can go to to learn more about the program itself.

Once you have identified the module that you want to use, you install it using the following command:

Code Block
 module load bowtie/2.3.4
Tip
titleUsing the version numbers for module commands

While not strictly necessary, using the "/2.3.4" text is a very good habit to get into as it controls what version is to be loaded. In this case the "2.3.4" version is the only (and thus default version) and module load bowtie will behave identically to module load bowtie/2.3.4 but that will not always be the case, particularly if in the future TACC installs a new version of bowtie.

While it is tempting to only use "module load name" without the version numbers, using the version numbers can help keep track of what versions were used for referencing in your future publications, and make it easier to identify what went wrong when scripts that have been working for months or years suddenly stop working (ie TACC changed the default version of a program you are using).

 Since the module load command doesn't give any output, it is often useful to check what modules you have installed with either of the following commands:

Code Block
module list
module list bowtie

The first example will list all currently installed modules while the second will only list modules containing bowtie in the name. If you see that you have installed the wrong version of something, a module is conflicting with another, or just don't feel like having it turned on anymore, use the following command:

Code Block
 module unload bowtie

You will notice when you type module list you have several different modules installed already. These come from both TACC defaults (TACC, linux, etc), and several that are used so commonly both in this class and by biologists that it becomes cumbersome to type "module load python2" all the time and therefore we just have them turned on by default by putting them in our profile to load on startup.  As you advance in your own data analysis you may start to find yourself constantly loading modules as well. When you become tiered of doing this (or see jobs fail to run because the modules that load on the compute nodes are based on your .bashrc file plus commands given to each node), you may want to add additional modules to your .bashrc file. This can be done using the "nano .bashrc" command from your home directory.

Downloading from the web directly to tacc

This is about using the wget command. wget stands for Web get is a simple way of downloading a file from a web address to your current directory. Typically this makes use of the "Copy Link Address" option when you right click on a link in a web browser that would otherwise start a download to your computer. 

Here we will install the trimmomatic read trimming tool. As we will mention in our next tutorial, trimmomatic is a very robust trimming tool that can be integrated into a standard analysis pipeline, and one of the optional tutorials will go over its use.

In a new web browser window/tab, navigate to the trimmomatic home page. Trimmomatic is far above average for as far as programs go, most will not have a user manual, may not have been updated since originally published, etc. This is what makes it such a good tool. For now we will focus on Downloading Trimmomatic section; right click on the 'binary' link for version 0.39 and copy that link address.

Info
titleWhich to choose binary files or uncompiled source code
The binary files will be what you want 100 out of 100 times, likely until you begin working with a specific program that you identify bugs in, submit them to the developers, they actually respond (most programs are not in active development), they try to address them, and begin asking you to try using the compiled version to check different scenarios. 

Use the wget command to download the link you just copied to a new folder named src in your $WORK directorymany programs if you know where to look.

After explaining the module system which we will use extensively throughout the course, we'll install 3 separate programs that we may use later in the class via 3 different means. This is an incomplete list of ways to install new programs to use, but is meant to be a good working example that you can adapt to install other programs in your future work. If you choose to do one of the optional tutorials that involve the programs installed here the program installation will be covered in more detail at that time.

TACC modules

Modules are programs or sets of programs that have been set up to run on TACC. They make managing your computational environment very easy. All you have to do is load the modules that you need and a lot of the advanced wizardry needed to set up the linux environment has already been done for you. New commands just appear.

To see all modules available in the current context, type:

Code Block
languagebash
titlelist all modules
 module avail

Remember you can hit the "q" key to exit out of the "more" system, or just keep hitting return to see all of the modules available. The "module avail" command is not the most useful of commands if you have some idea of what you are looking for. For example imagine you want to align a few million next generation sequencing reads to a genome, but you don't know what your options are. You can use the following command to get a list of programs that may be useful:

Code Block
languagebash
titleList all modules containing a particular term
module keyword alignment

Note that this may not be an inclusive list as it requires the name of the program, or its description to contain the word "alignment". Looking through the results you may notice some of the programs you already know and use for aligning 2 sequences to each other such as blast and clustalw. Try broadening your results a little by searching for "align" rather than "alignment" to see how important word choice is. When you compare the two sets of results you will see that one of the new results is:

Code Block
bowtie: bowtie/2.3.4
	Memory-efficient short read (NGS) aligner

 This may sound much better, but you still only have limited information about it. To learn more about a particular program, try the following 2 commands:

Code Block
languagebash
titleGet more information on particular module
module spider bowtie

module spider bowtie/2.3.4

In the first case, we see information about what versions of bowtie lonestar has available for us, but really that is just the same information as we had from our previous search. This can be particularly useful when you know what program you want to use but don't know what versions are available. In the second case we now have more detailed information about the particular version of interest including websites we can go to to learn more about the program itself.

Once you have identified the module that you want to use, you install it using the following command:

Code Block
 module load bowtie/2.3.4
Tip
titleUsing the version numbers for module commands

While not strictly necessary, using the "/2.3.4" text is a very good habit to get into as it controls what version is to be loaded. In this case the "2.3.4" version is the only (and thus default version) and module load bowtie will behave identically to module load bowtie/2.3.4 but that will not always be the case, particularly if in the future TACC installs a new version of bowtie.

While it is tempting to only use "module load name" without the version numbers, using the version numbers can help keep track of what versions were used for referencing in your future publications, and make it easier to identify what went wrong when scripts that have been working for months or years suddenly stop working (ie TACC changed the default version of a program you are using).


 Since the module load command doesn't give any output, it is often useful to check what modules you have installed with either of the following commands:

Code Block
module list
module list bowtie

The first example will list all currently installed modules while the second will only list modules containing bowtie in the name. If you see that you have installed the wrong version of something, a module is conflicting with another, or just don't feel like having it turned on anymore, use the following command:

Code Block
 module unload bowtie

You will notice when you type module list you have several different modules installed already. These come from both TACC defaults (TACC, linux, etc), and several that are used so commonly both in this class and by biologists that it becomes cumbersome to type "module load python3" all the time and therefore we just have them turned on by default by putting them in our profile to load on startup.  As you advance in your own data analysis you may start to find yourself constantly loading modules as well. When you become tiered of doing this (or see jobs fail to run because the modules that load on the compute nodes are based on your .bashrc file plus commands given to each node), you may want to add additional modules to your .bashrc file. This can be done using the "nano .bashrc" command from your home directory.

Downloading from the web directly to tacc

This is about using the wget command. wget stands for Web get is a simple way of downloading a file from a web address to your current directory. Typically this makes use of the "Copy Link Address" option when you right click on a link in a web browser that would otherwise start a download to your computer. 

Here we will install the trimmomatic read trimming tool. As we will mention in our next tutorial, trimmomatic is a very robust trimming tool that can be integrated into a standard analysis pipeline, and one of the optional tutorials will go over its use.

In a new web browser window/tab, navigate to the trimmomatic home page. Trimmomatic is far above average for as far as programs go, most will not have a user manual, may not have been updated since originally published, etc. This is what makes it such a good tool. For now we will focus on Downloading Trimmomatic section; right click on the 'binary' link for version 0.39 and copy that link address.

Info
titleWhich to choose binary files or uncompiled source code
The binary files will be what you want 100 out of 100 times, likely until you begin working with a specific program that you identify bugs in, submit them to the developers, they actually respond (most programs are not in active development), they try to address them, and begin asking you to try using the compiled version to check different scenarios. 

Use the wget command to download the link you just copied to a new folder named src in your $WORK directory.

Code Block
languagebash
titleUsing the mkdir command to create a folder named 'src' inside of your $WORK directory
collapsetrue
cd $WORK
mkdir src
cd src

If you already have a src directory, you'll get a very benign error message stating that the folder already exists and thus can not be created. 

Code Block
languagebash
titleThe wget command is very simple. It has 2 parts: 1. the command 'wget', and 2. the location of the file you want to download.
collapsetrue
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip

You should see a download bar showing you the file has begun downloading, when complete the ls command will show you a new compressed file named "Trimmomatic-0.39.zip". Next we need to uncompress this file, and copy the executable file to a location already in our $PATH variable.

Code Block
languagebash
unzip Trimmomatic-0.39.zip
cd Trimmomatic-0.39
cp trimmomatic-0.39.jar $HOME/local/bin

If you don't see the zip file or are unable to cd into the 0.39 directory after unzipping it let the instructor know.

As this tutorial is focused only on downloading interesting programs you may read about, this is the final step. If you do the optional trimmomatic tutorial later in the course, we'll go over some of the nuances of trimmomatic and shortcuts to make it easier to use.

wget alternative

It is always an alternative to download such files directly to your computer using a web browser and then use the scp command to transfer it to TACC. The wget command can help you avoid these intermediate steps and is more convenient most of the time unless you want to install the program on both your laptop and TACC, and have the same operating system on both.

Github

This is about using the git clone command. Git is a command often used for collaborative program development or sharing of files. Some groups also put the programs or scripts associated with a particular paper on a github project and publish the link in their paper or on their lab website. 

Here we will clone the github repository for breseq which is developed by the Barrick lab here at UT and is used to comprehensively analyze haploid microbial genomes to identify all variants present. In some of the initial tutorials everyone will use a version of breseq that is available through the BioITeam, in the optional tutorials you may compile your own copy of breseq from this github project to underscore why binary files are typically preferred, or as a way of easily staying up to date on new developments with the program itself.

Initially cloning a github repository as exceptionally similar to using the wget command to download the repository, it involves typing 'git clone' followed by a web address where the repository is stored. As we did for installing trimmomatic with wget we'll clone the repository into a 'src' directory inside of $WORK.

Code Block
languagebash
titleUsing the mkdir command to create a folder named 'src' inside of your $WORK directory
collapsetrue
cd $WORK
mkdir src
cd src

If you already have a src directory, you'll get a very benign error message stating that the folder already exists and thus can not be created. 

Code Block
languagebash
titleThe wget command is very simple. It has 2 parts: 1. the command 'wget', and 2. the location of the file you want to download.
collapsetrue
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip

You should see a download bar showing you the file has begun downloading, when complete the ls command will show you a new compressed file named "Trimmomatic-0.39.zip". Next we need to uncompress this file, and copy the executable file to a location already in our $PATH variable.

Code Block
languagebash
unzip Trimmomatic-0.39.zip
cd Trimmomatic-0.39
cp trimmomatic-0.39.jar $HOME/local/bin

As this tutorial is focused only on downloading interesting programs you may read about, this is the final step. If you do the optional trimmomatic tutorial later in the course, we'll go over some of the nuances of trimmomatic and shortcuts to make it easier to use. This page will be updated to include a link to said tutorial later in the week.

wget alternative

It is always an alternative to download such files directly to your computer using a web browser and then use the scp command to transfer it to TACC. The wget command can help you avoid these intermediate steps and is more convenient most of the time unless you want to install the program on both your laptop and TACC, and have the same operating system on both.

Github

This is done using the git clone command.In a web browser navigate to github and search for 'breseq' in the top right corner of the page. The top result will be for barricklab/breseq; click the green box for 'clone or download' and either control/command + C on the address listed, or click the clipboard icon to copy the repository address. This image may be helpful if you are are having trouble locating the green box

Code Block
languagebash
titleOnce you have copied the address and are in the $WORK/src directory clone the repository with 'git clone'
collapsetrue
git clone https://github.com/barricklab/breseq.git

You will see several download indicators increase to 100%, and when you get your command prompt back the ls command will show a new folder named 'breseq' containing a set of files. If you don't see said directory, or can't cd into that directory let the instructor know.

As with Trimmomatic, these files will require additional work that is somewhat specific to the specific program and there for beyond the scope of this tutorial. A link to the advanced tutorials for getting your own copy of breseq up and running will be added later in the week. 

pip

This is about using the pip3 install command. pip is the standard package manager for the common programing language python. When labs put together new analysis programs/packages, increasingly they try to make these programs available for others to use via pip. pip3 rather than just pip references the current specific version of python.

Here we will install the multiqc analysis program which compiles reports from a program called fastqc about the quality of fastq files from multiple different samples at one time. In the later portion of the class you may choose to work with this program to get a better overall view of multiple fastq files all at once rather than clicking through individual reports.

...

If you still see something else, let the instructor know.

The multiqc tutorial can be found here.

This concludes the the linux and lonestar refresher/introduction tutorial.

Genome Variant Analysis Course 2020 home.

...