Linux and stampede2 Setup -- GVA2023


Overview:

This portion of the class is devoted to making sure we are all starting from the same starting point on stampede. This tutorial was developed as a combined version of multiple other tutorials which were previously given credit here. Anyone wishing to use this tutorial is welcome.

This is probably the longest tutorial in the entire class. It is designed to take between 1/2 and 3/4 of the first class. Do not stress if you feel people are moving through it faster than you are, or if you do not get it done before the next presentation. There will be links back to this tutorial from other tutorials as needed, and by the 2nd half of Wednesday's class when we start with the specialized tutorials, you can circle back to this tutorial as well. 

Class Structure

As mentioned in the email that went out last week, the course is being offered in a hybrid format with some participants being in person and some attending only on zoom. Everyone is welcome to attend in either format on any day. If you have any questions please dont hesitate to reach out. Each day during class, I'll walk around the room while you work on the tutorials and look at various screens to see if I notice any issues individuals are running into, but please just get my attention if you know you are running into a problem. For those on zoom I'll have an ear bud in my ear so the easiest way to get my attention will be to just unmute and say something, but i will also circle by my computer and check if anyone has sent a chat message about an issue.

Objectives:

  1. Familiarize yourself with the way course material will be presented.
  2. Log into stampede2.
  3. Change your stampede2 profile to the course specific format.
  4. Refresh understanding of basic linux commands with some course organization.
  5. Review use of the nano text editor program, and become familiar with several other text editor programs.

Example things you will encounter in the course:

As this is the first real tutorial you are encountering in this course, some housekeeping matters to familiarize you with how information will be presented.

  • Code blocks

There will be 4 types of code blocks used throughout this class. Text inside of code blocks represent at least 1 possible correct answer, and should either be typed EXACTLY into the terminal window as they are, or copy pasted. There is a notable exception that text between <> symbols represent something that you need to replace before sending it to the terminal. Yes, the <> marks themselves also need to be replaced. We try to put informative text within the brackets so you know what to replace it with. If you are ever unsure of what to replace the <> text with, just ask.

  1. Visible
    1. These are code blocks that you would likely have no idea what to type without help. This is common when a new command is being introduced, or when things you might be able to guess at are wrong in some way.
    2. These will typically be associated with longer/more detailed text above the text box explaining things.
    3. An example code block showing you the command you need to type into the prompt to list what directory you are currently in:

      pwd
  2. Hinted
    1. These are code blocks that you can probably figure out what to type with a hint that goes beyond what the tutorial is requesting. Access the hint by clicking the triangle or hint hyperlink text.
    2. These exist to force you to think about what command you need, and hopefully make some connections to help you remember what you will need to type in the future.
    3. These should all come with additional explanation as to what is going on.
    4. Rather than just expanding these by reflex, I strongly suggest seeing if you can figure out what the command will be, and checking your work
    5. Example:

       what command would you use to Print your current Working Directory

      In this example the letters P W and D are all capitalized to try to help you focus on the command itself

      pwd 
  3. Hidden:
    1. These code blocks represent things that you should have seen or used several times already, or things that can be succinctly explained.
    2. Example:

      use the pwd command to print your current working directory
      pwd
  4. Speed bump:

    1. This combines the previous 2 types to deliberately slow you down and be cumbersome. 
    2. If you find yourself consistently wrong about what eventually shows up in the text box, slow down, step back, think about whats going on, and consider asking a question.
    3. These should only come after you have seen the same (or very similar) commands in the other formats previously
    4. Example:

       print your current working directory

      Remember, the command you need is "pwd".

      This command needs no options
      pwd
  • Warnings

Why the tutorials have warnings?

Warnings exist for 2 reasons:

  1. Something you are about to do can have negative impact on you
    1. You saw an example of this talking about paying attention to warnings when using ssh to access new remote computers
  2. Something you are about to do can have negative impacts on others
    1. this will be related mostly to the use of "idev" sessions beginning tomorrow.
  • Info boxes

These are used to give more general background about things

These were introduced in the last few years, but despite requests in post-class surveys, not much feedback was provided about them. If you find them useful (or have ideas of how they might be more useful) please remember to mention them in the post class survey. At very least the hope is that they help organize information. The information in these boxes is not needed to complete the tutorials.

  • Tip boxes

Things I wish I knew sooner

Two examples that will help you throughout the course:

  1. On the command line, you can use the tab key to try to autofill the "rest" of whatever you are typing, whether it is the name of the directory, a long file, or even a command. Hitting tab twice will list all possible matches to whatever you have already typed when there are multiple different possibilities. The more you use this, the fewer typos you will have as a typo can't autofill.
  2. You can use the up and down arrows to scroll through your previously typed commands. This can be especially helpful when you have typed a long command and get an error because of a typo as rather than retyping the entire thing and risking a new typo, you can just hit the up arrow and correct the error.


Tutorial:

  • Logging into stampede2

I think everyone was able to log into stampede2 last week as part of the pre-class assignment. If not make sure the instructor is aware as there are additional elements that still need to be addressed (potentially adding you to the project allocation and definitely being added to the reservation that we will use starting tomorrow). 


log into stampede2 with the ssh command
ssh <username>@stampede2.tacc.utexas.edu

When prompted enter your password, and digital security code from the app, and answer "yes" to the security question if you see one. If you previously have logged in you will not see such a question prompt.

Logging into remote computers

You are blindly told to enter yes here, only because you are given a command above to copy which will take you to a remote computer system that I know to be safe, and as this is an introductory class, it is likely you have not logged into it before. If you have previously logged into this remote computer from the local computer you are sitting at, you will not be issued a security warning prompt.

The same will be true the first time you log into any of the other TACC resource, or other remote computer. This means that it should be rare that you encounter such a prompt, and more rare still that you are surprised to find one.  If you ever see a security warning logging into somewhere that you use commonly you should answer no and try to figure out why you were warned. If you are not surprised to encounter it, if you have figured out why you encountered it, or understand the risks,  type "yes" to bypass the security check.



As a reminder, the ssh command, and launching programs to give you the prompt to type them was provided as part of the pre-class assignment. Convenient links incase you need them or want to refresh your memory:


  • Setting up your stampede2 profile

There are many flavors of Linux/Unix shells. The default for TACC's Linux (and most other Linuxes) is bash (bourne again shell), which we will use throughout. I am not aware of any others being used by biologists, so this is likely just something you will always default to.

Whenever you login via an interactive shell as you did above, a well-known script is executed by the shell to establish your favorite environment settings. I've set up a common profile for you to start with that will help you know where you are in the file system and make it easier to access some of our shared resources. If you already have a profile set up on stampede2 that you like, we want to make sure that we don't destroy it but it is critical to make sure that we change it temporarily so everyone is working from the same place through the class. Use the ls command to check if you have a profile already set up in your home directory.

Use ls to check if particular file exists
cdh
ls .profile
ls .bashrc


If you already have a .profile or .bashrc file, use the mv command to change the name to something descriptive (for example ".profile_pre_GVA_backup"). Otherwise continue to creating a new files.

Use mv to change your .profile file to a backup copy
mv .profile profile_pre_GVA_backup
mv .bashrc bashrc_pre_GVA_backup

A warning about deleting files

Most of us are used to having an 'undo' button, trash/recycling collection of deleted files, or warnings when we tell a computer to do something that can't be undone. The command line offers none of these options. In extreme situations on TACC, you can use the help desk ticket system to recover a deleted file, but there is no guarantee files can be recovered under normal circumstances (we will cover exceptions to this later).

The specific warning right now is that if you have an existing profile, and have not done the above commands correctly, you will not be able to recover your existing profile. Thus this is a great opportunity to interact with your instructor and make 100% the above steps have been correctly performed. Type ls -al onto the command line


Now that we have backed up your profiles so you won't lose any previous settings, you can copy our predefined GVA.bashrc file from the /corral-repl/utexas/BioITeam/gva_course/ folder to your $HOME folder as .bashrc and the predefined GVA.profile as .profile from the same location before using the chmod command to change the permissions to read and write for the user only.

Copy the course provided .profile file and change its name and permissions
cp /corral-repl/utexas/BioITeam/gva_course/GVA.bashrc .bashrc
cp /corral-repl/utexas/BioITeam/gva_course/GVA.profile .profile
chmod 700 .bashrc
chmod 700 .profile


Future reference regarding bashrc and profile files

If these files are updated for future classes, these existing versions that you are working with now will be copied to the same location but listed as "GVA2023" instead of just GVA. This is unlikely to be relevant but if you are working with this 12+ months from now be aware.



The chmod 700 <FILE> command marks the file as readable/writable/executable only by you. The .bashrc script file will not be executed unless it has these permissions settings. 

Understanding why some files start with a "."

In the above code box, you see that the names start with a . when a filename starts with a . it conveys a special meaning to the operating system/command line. Specifically, it prevents that file from being displayed when you use the ls command unless you specifically ask for hidden files to be displayed using the -a option. Such files are termed "dot-files" if you are interested in researching them further.

Let's look at a few different ways we will use the ls command throughout the course. Compare the output of the following 4 commands:

Standard output
ls              #ignore everything that comes after the # mark. There is a problem on this wiki page but things after a # wont effect commands  
Standard output plus hidden files
ls -a
Standard output plus hidden files in a single column
ls -a -1
Standard output plus hidden files in a single column with additional information
ls -a -l

Throughout the course you will notice that many options are supplied to commands via a single dash immediately followed by a single letter. Usually when you have multiple commands supplied in this manner you can combine all the letters after a single dash to make things easier/faster to type. Experiment a little to prove to yourself that the following 2 commands give the same output.

Standard output plus hidden files in a single column
ls -a -1

ls -al

While knowing that you can combine options in this way helps you analyze data faster/better, the real value comes from being able to decipher commands you come across on help forums, or in publications.

For ls specifically the following association table is worth making note of, but if you want the 'official' names consider using the man command to bring up the ls manual.

flagassociation
-a"all" files
-l"long" listing of file information
-11 column
-hhuman readable


Getting back to your profile... Since .bashrc is executed when you login, to ensure it is set up properly you should first logout:

How to leave stampede2 by logout or exit from a remote connection
logout
# or
exit

then log back in:

Go log back in to stampede2
ssh <username>@stampede2.tacc.utexas.edu

If everything is working correctly you should now see this as your prompt:  

tacc:~$

It is also likely or expected that upon logging in you see the following:

The following have been reloaded with a version change:
  1) impi/18.0.2 => impi/17.0.3     2) intel/18.0.2 => intel/17.0.4     3) python2/2.7.15 => python2/2.7.14

These messages have to do with some of the core compilers and associated tools on TACC. You could use the module spider commands detailed below to find out more information of any of these modules and track down why such changes might be being made, but they are not concerning.


If you see anything besides "tacc:~$" as your prompt, get my attention rather than continuing forward as something has gone wrong.



  • Setting up other shortcuts:

In order to make navigating to the different file systems on stampede2 a little easier ($SCRATCH and $WORK), you can set up some shortcuts with these commands that create folders that "link" to those locations. Run these commands when logged into stampede2 with a terminal, from your home directory.

Creating a shortcut to the main Stampede2 working directories
cdh
ln -s $SCRATCH scratch
ln -s $WORK work
ln -s $BI BioITeam

In previous years, several people have report seeing an error message stating "ln: failed to create symbolic link 'BioITeam/BioITeam': Permission denied." This seems to be related to different project allocations. I do not think it will be an issue for anyone this year.

  • Understanding what your .bashrc file actually does.

 While interesting and useful information to have, understanding it is not critical to variant analysis. I suggest you to look through this information after you complete the rest of the tutorial, in your free time, or when you need to modify your profile or bashrc files in the future.

Let's look at what your .bashrc profile actually does. Use the cat command to print contents of the .bashrc file to the screen.

Print the contents of the .profile file to the screen
cat .bashrc

This will print several lines of text to the terminal window. Let's look at what some of these lines do with a little more information:

  • lines that start with #

    • Any line begins with a # symbol, is "commented out". Anything after a # symbol will not be executed by any program. Programers commonly make use of behavior to leave notes for others, or even themselves at a later date as to what particular lines of a script are actually doing.
  • Section 1 has multiple lines involving "module load <NAME>"

    • This loads different modules by default. We have included basic ones that will help with basic TACC things. After we review the use of the nano text editor we'll go into more depth with TACC modules. But for now trust us when we say that not having to load a bunch of modules every time you log into TACC is a good thing.

    • In previous years the module system was used more extensively. We now rely more on miniconda installations for increased portability. If you find yourself working within TACC (or equivalent resources), the module system (or similar systems) can be very advantageous. 
  • Section 2 has multiple lines starting with "export"

    • The export lines define shell variables for example BI and PATH. You've already seen how using $BI can come in handy accessing our shared course directory. As for PATH, that is a well-known environment variable that defines a set of directories where the shell will look when you type in a program's name. Our shared profile adds the common course directories that we copied at the start of this tutorial and your local ~/local/bin directory (which does not exist yet) to the location list. You can see the entire list of locations by doing this:

      How to see where the bash shell looks for programs
      echo $PATH

      As you can see, there are a lot of locations on the path. That's because when you load modules at TACC (see above), that mechanism makes the programs available to you by putting their installation directories on your $PATH.

  • umask 002

    • The umask command is used to set the default permissions of newly created files and directories limiting the need to use the chmod command. umask functions as the inverse of chmod meaning that it subtracts the values from the default permissions. In this case the command umask 002 is the equivalent of the command chmod 775 for directories, and chmod 664 for files. in summary, having this command in your .profile gives all new files you create read and write access to both you and your group while giving read only access to everyone else.
  • PS1='tacc:\w$ '

    • The PS1='tacc:\w$ ' line is a special setting that tells the shell to display the current directory as part of its prompt. It saves you typing pwd all the time to see where you are in the directory hierarchy. Try using the mkdir command to make a new directory called tmp and change into that directory to see what it does to your prompt. This page maybe useful if you want to further customize your prompt after the course.

      See how your prompt reflects your current directory
      mkdir tmp
      cd tmp
    • Your prompt should have changed from: "tacc:~$"to now be "tacc:~/tmp$". Your prompt now tells you you are in the tmp subdirectory of your home directory (~). See if you can figure out how to return to your home directory without expanding the code block. Expand the following code block to see the different ways of returning to your home directory.

      How to return to your home directory
      cd
      cdh
      cd $HOME
      cd ~
      cd -

      The last example in the above code block will return you to your previous directory. In this case, that means the home directory, but it can be very useful in other situations when you change directories to do something in 1 place then need to hop back to where you were, or if you mistakenly leave a directory.


  • Editing files

There are a number of options for editing files at TACC. These fall into three categories:

  • Linux text editors installed at TACC (nanoviemacs). These run in your terminal window. vi and emacs are extremely powerful but also quite complex, so nano is the best choice as a first local text editor. It is also powerful enough that you can still accomplish whatever you are working on, it just might be more difficult if you try to do more complex edits. If you are already familiar with one of the other programs you are welcome to continue using it. If this is something you plan to use long term, it is worth spending the time to learn to rely on something other than nano after this class.
  • A former lab member suggested that vs code may be the best current platform to combine much of this, and while I trust his experience and suggestion I don't have personal familiarity with it https://code.visualstudio.com/docs/remote/ssh .
  • Text editors or IDEs that run on your local computer but have an SFTP (secure FTP) interface that lets you connect to a remote computer (Notepad++ or Komodo Edit). Once you connect to the remote host, you can navigate its directory structure and edit files. When you open a file, its contents are brought over the network into the text editor's edit window, then saved back when you save the file.
  • Software that will allow you to mount your home directory on TACC as if it were a normal disk e.g. MacFuse/MacFusion for Mac, or ExpanDrive for Windows or Mac ($$, but free trial). Then, you can use any text editor to open files and copy them to your computer with the usual drag-drop.

We'll go over nano together in class, but you may find these other options more useful for your day-to-day work so feel free to go over these sections in your free time to familiarize yourself with their workings to see if one is better for you.

 Komodo Edit for Mac and Windows

Komodo Edit is another free, full-featured text editor with syntax coloring for many programming languages and a remote file editing interface. It has versions for both Macintosh and Windows. Download the appropriate install image here.

Once installed, start Komodo Edit and follow these steps to configure it:

  • Configure the default line separator for Unix
    • On the Edit menu select Preferences
    • Select the New Files Category
    • For Specify the end-of-line (EOL) indicator for newly created files select UNIX (\n)
    • Select OK
  • Configure a connection to TACC
    • On the Edit menu select Preferences
    • Select the Servers Category
    • For Server type select SFTP
    • Give this profile the Name of stampede2
    • For Hostname enter stampede2.tacc.utexas.edu
    • Enter your TACC user ID for Username
    • Leave Port and Default path blank
    • Select OK

When you want to open an existing file at stampede2, do the following:

  • Select the File menu -> Open -> Remote File
    • Select your stampede2 profile from the top Server drop-down menu
    • Once you log in, it should show you all the files and directories in your stampede2 $HOME directory
  • Navigate to the file you want and open it
    • Often you will use the work or scratch directory links to help you here

To create and save a new file, do the following:

  • From the Komodo Edit Start Page, select New File
    • Select the file type (Text is good for commands files)
  • Edit the contents
  • Select the File menu -> Save As Other -> Remote File
    • Select your Stampede2 profile from the Server drop-down menu
    • Once you log in, it should show you all the files and directories in your stampede $HOME directory
  • Navigate to where you want the put the file and save it
    • Often you will use the work or scratch directory links to help you here
 Notepad++ for Windows

Notepad++ is an open source, full-featured text editor for Windows PCs (not Macs). It has syntax coloring for many programming languages (Python, Perl, shell), and a remote file editing interface.

If you're on a Windows PC download the installer here.

Once it has been installed, start Notepad++ and follow these steps to configure it:

  • Configure the default line separator for Unix
    • In the Settings menu, select Preferences
    • In the Preferences dialog, select the New Document/Default Directory tab.
    • Select Unix in the Format section
    • Close
  • Configure a connection to TACC
    • In the Plugins menu, select NppFTP, then select Focus NppFTP Window. The top bar of the NppFTP panel should become blue.
    • Click the Settings icon (looks like a gear), then select Profile Settings
    • In the Profile settings dialog click Add new
    • Call the new profile stampede
    • Fill in Hostname (stampede2.tacc.utexas.edu) and your TACC user ID
    • Connection type must be SFTP
    • Close

To open the connection, click the blue (Dis)connect icon then select stampede connection. It should prompt for your password. Once you've authenticated, a directory tree ending in your home directory will be visible in the NppFTP window. You can click the the (Dis)connect icon again to Disconnect when you're done.

Since much of the editing we'll do will be in your SCRATCH area at TACC, rather than having to navigate around TACC's complex file system tree, it helps to create symbolic links to your WORK and SCRATCH directory in your home directory. Then you'll be able to get there just by clicking on the scratch or work folder in the Notepad++ Remote directory tree. See below for how to do this.

 MacFuse/MacFusion/TextWrangler for Mac

Want your stampede2 files to appear like any other place on your hard drive? You can do this using MacFuse/MacFusion on a Mac.

Want to edit files on TACC without having to use nano? You might want to use BBedit, a text editor that can edit files over ssh.

Editing Text Files on TACC: BBedit

BBedit is a recommended FreeWare text editor for MacOS X. You can use it to directly edit text files on stampede2 with OSXFuse/MacFusion using a nice GUI. It is a much more powerful text editor than TextEdit, and won't trip you up by wrapping lines etc., if you learn to use it.

Even if you cannot install OSXFuse/MacFusion, BBedit allows you to edit a remote file via SSH. To do this:

  1. Select *File > Open from FTP/SFTP Server...
  2. Type stampede2.tacc.utexas.edu, your username, and your password into the appropriate boxes.
  3. Check the You need to check the SFTP box.
  4. Click connect.
  5. You will now have a file browser window. You can create new files and edit existing files on stampede, but won't be able to drag-and-drop copy files.

Tip: Files beginning in a dot (.) like (.bashrc) are "hidden" and won't show up when you are navigating in Finder (if using OSXFuse/MacFusion). There is a way to turn on showing these files in finder, but it can get annoying because they will show up everywhere. If you use the TextWrangler "open" command to open a file, there is a box that you can check to show these files.

Connecting to TACC Like a Hard Drive: MacFuse/MacFusion

Here are the steps for an installation:

  1. Download and install FUSE for OS X.
    • Check the option to install the "compatibility layer"
  2. Download MacFusion.
    • Move the app that gets downloaded to your Applications folder
  3. Restart your computer.
  4. Open the MacFusion application.
  5. Click the + menu in the window and select SSHFS. Enter your login information for stampede2. Choose connect. The remote file system will appear in Finder (depending on your settings it may be on the desktop or inside the computer shortcut in the side of a Finder window). You can also click on the mounted volume within MacFusion and choose "Reveal" from the gear menu.

Copying Files To and From TACC: SFTP Clients

If you can't get OSXFuse/MacFusion to work, you can still copy files back and forth between your computer and TACC using a secure FTP (SFTP) client. Some examples of free programs for Mac are:

As we will be using nano throughout the class, it is a good idea to review some of the basics. nano is a very simple editor available on most Linux systems. If you are able to use ssh, you can use nano. To invoke it, just type:

How to start the nano text editor
nano

You'll see a short menu of operations at the bottom of the terminal window. The most important are:

  • ctl-o - write out the file
  • ctl-x - exit nano
    You can just type in text, and navigate around using arrow keys. A couple of other navigation shortcuts:
  • ctl-a - go to start of line
  • ctl-e - go to end of line

Be careful with long lines – sometimes nano will split long lines into more than one line, which can cause problems in our commands files, and if you copy paste code into a nano editor.

 

What can you do to see contents of a file without opening it for editing?

Commanduseful forbad if
headseeing the first lines of a file (10 by default)file is binary
tailseeing the last lines of a file (10 by default)file is binary
catprint all lines of a file to the screenthe file is big and/or binary
lessopens the entire file in a separate program but does not allow editingif you are going to type a new command based on the content, or forget the q key exits the view, or file is binary
moreprints 1 page worth of a file to the screen, can hold enter key down to see next line repeatedly. Contents will remain when you scroll back up.you forget that you hit the q key to stop stop looking at the file, or file is binary

Note that all of the above state that it is bad to view binary files. Binary files exist for computers to read, not humans, and are thus best ignored. We'll go over this in more detail as well as some conversion steps when we deal with .sam and .bam files later in the course.

Many expect to see something that looks like something out of The Matrix, but unfortunately, you actually just see a bunch of gibberish and it can mess with your terminal. Typically if you accidentally or unknowingly try to view such a file, its best to just close your terminal window and start a new session.
  • How should we name files and folders?

In general you will want to adopt a consistent pattern of naming, and it should be your own and something that makes sense to you. After that there are some tips:

  1. The most important thing to get used to is the convention of using . _  or capitalizing the first letter in each word in names rather than spaces in names, and limiting your use of any other punctuation. Spaces are great for mac and windows folder names when you are using visual interfaces, but on the command line, a space is a signal to start doing something different. Imagine instead of a BioITeam folder you wanted to make it a little easier to read and wanted to call it "Bio Informatics Team" certainly everyone would agree its easier to read that way, but because of the spaces, bash will think you want to create 3 folders, 1 named Bio another named Informatics and a third named Team. Now this is certainly behavior you can use when appropriate to your advantage, but generally speaking spaces will not be your friend. Early on in my computational learning I was told "A computer will always do exactly what you told it to do. The trick is correctly telling it to do what you want it to do". 
  2. Name things something that makes it obvious to you what the contents are not just today but next week, next month, and next year even if you don't touch the it for weeks-months-years.
  3. Prefixing file/folder names with international date format (YYYY-MM-DD) will ensure that listing the contents will print in an order in which they were created. This can be useful when doing the same or similar analysis on new samples as new data is generated.


 People always ask "but can you name new directories or files to have spaces in them?"

To answer the question, Yes, files/folders can have spaces. This is hidden away to keep you from accidentally thinking that this is a good idea. LET ME STRESS AGAIN this is a horrible habit to get into and will lead to unforced errors.

Instead let's think about this from the prospective of encountering files or directories that you are working with but didn't create that have spaces in them. Assumably because a colleague who didn't take this course sent you some data, and not because you thought it was a good idea personally. Spaces can be "escaped" like many other special characters. Imagine someone sent you directory name "This is really annoying to use but I don't know it yet" to change into that directory you would have to type:

cd this\ is\ really\ annoying\ to\ use\ but\ I\ don\'t\ know\ it\ yet

Notice that the apostrophe also had to be escaped, which should help show you not to use other punctuation.

The tab key would automatically add the escape character for you when there are no other matches to something else in the same path.


  • Understanding TACC

Now that we've been using stampede2 for a little bit, and have it behaving in a way that is a little more useful to us, let's get more of a functional understanding of what exactly it is and how it works.

Diagram of Stampede2 directories: What connects to what, how fast, and for how long.

Stampede2 is a computer cluster connected to three file servers (each with unique characteristics), and other computer infrastructure. For the purpose of this class, and your own work, you only need to understand the basics of the 3 file servers to know how to use them effectively. The 3 servers are named, "HOME", "WORK", and "SCRATCH", and we will work with them all over the next 5 days


$HOME

$WORK

$SCRATCH

Purged?

No

No

Files can be purged if not accessed for 10 days.

Backed Up?

Yes

No

No

Capacity

10GB

1TB

Basically infinite.

Commands to Access

cdh

cd $HOME/

cdw

cd $WORK/

cds

cd $SCRATCH/

Purpose

Store Executables

Store Files and Programs

Run Jobs 

Time spentWhen modifying basic settingsWhen installing new programs;
Storing raw or final data
When analyzing data

Executables that aren't available on TACC through the "module" command should be stored in $HOME.

If you plan to be using a set of files frequently or would like to save the results of a job, they should be stored in $WORK. While 1TB may seem like a lot of space you can easily fill it up with just a few sequencing projects, particularly if you store files in a non-compressed manner, or wish to store analyzed intermediates. Best practice would be to store the most important files (raw > scripts > final > analysis files) on a system such as corral or backed up to something else such as UTbox

If you're going to run a job, it's a good idea to keep your input files in a directory in $WORK and copy them to a directory in $SCRATCH where you plan to run your job.

Example command for copying data from a $WORK directory to $SCRATCH . This command is only an example of something you may use in the future. As you do not have any fastq files on $WORK, or at least likely do not have them in a folder titled 'my_fastq_data' if you tried this command you would be expected to get a message stating no such file or directory found.
 cp $WORK/my_fastq_data/*fastq $SCRATCH/my_project/

Understanding "jobs" and compute nodes.


When you log into stampede2 using ssh you are connected to what is known as the login node or "the head node". There are several different head nodes, but they are shared by everyone that is logged into stampede2 (not just in this class, or from campus, or even from Texas, but everywhere in the world). Anything you type onto the command line has to be executed by the head node. The longer something takes to complete, or the more commands you send at once the slower the head node will work for you and everybody else. Get enough people running large jobs on the head node all at once (say a class of summer school students) and stampede2 can actually crash leaving nobody able to execute commands or even log in for minutes -> hours -> perhaps even days if something goes really wrong. To try to avoid crashes, TACC tries to monitor things and proactively stop things before they get too out of hand. If you guess wrong on if something is safe to run on the head node, you may eventually see a message like the one pasted below. If you do, it's not the end of the world, but repeated messages will lead to revoked TACC access and emails where you have to explain what you are doing to TACC and your PI and how you are going to fix it and avoid it in the future.  

Example of how you learn you shouldn't have been on the head node
Message from root@login1.ls4.tacc.utexas.edu on pts/127 at 09:16 ...
Please do not run scripts or programs that require more than a few minutes of
CPU time on the login nodes.  Your current running process below has been
killed and must be submitted to the queues, for usage policy see
http://www.tacc.utexas.edu/user-services/usage-policies/
If you have any questions regarding this, please submit a consulting ticket.

So you may be asking yourself what the point of using stampede2 is at all if it is wrought with so many issues. The answer comes in the form of compute nodes. There are nearly 6,000 compute nodes with different configurations that can only be accessed by a single person for a specified amount of time. For the duration of the class, each student will interact with a single compute node using an interactive DEVelopment (iDEV) session so that you get immediate feedback of seeing commands being run and know when to use the next command. This is not the typical way you will analyze your own data. Friday's tutorial will deal with the queue system.

While stampede2 is tremendously powerful and will greatly speed up your analysis, it doesn't have much in the way of a GUI (graphical user interface). The lack of a GUI means it can't visualize graphs or other meaningful representations of our data that we are used to seeing. In order to do these types of things, we have to get our data off of stampede2 and onto our own computers. This course uses the scp ("secure copy command") exclusively to move files back to your local computer, as mentioned there are other programs that can be configured to more easily transfer files back and forth as you progress in your analysis.

Transferring files to and from stampede2 with scp

When this class was taught in person, it was helpful to have a small set of steps on transferring files between stampede2 and your local computer which tended to give people problems. The idea being that some problems on the first day would eventually work themselves out through the week as the SCP command was repeatedly used. Given how zoom makes it more difficult for problems to be identified, this tutorial has been moved to its own tutorial page so that it can be more referenced more easily when files are to be transferred in future tutorials. for now, focus on transferring the