Table of Contents |
---|
Overview:
This portion of the class is devoted to making sure we are all starting from the same starting point on lonestar. This tutorial is adapted from a previous version which allowed for set up on the now decommissioned lonestar4. Portions of this tutorial were adapted from previous versions which can be found here, here, here, here, here, here, and and here. Collective thanks to all those that contributed to those works which now appear in a single version. Anyone wishing to use this tutorial is welcome.
...
Text inside of code blocks represent "right" answers, and should either be typed EXACTLY into the terminal window as they are, or copy pasted with a noteable exception. Things that exist within <> symbols represent something that you need to replace before sending it to the terminal. We try to put informative text within the brackets so you know what to replace it with. If you are ever unsure of what to replace the <> text with, just ask.
Before logging onto TACC servers, multi-factor authentication must be set up. Click here for an overview of this process, and click here to begin setting it up.
Using what we have just taught you about code blocks, log into lonestar. Since this is your first code box, it is probably worth expanding even if you know how to log into lonestar already.
...
When prompted enter your password, and digital security code from the app, and answer "yes" to the security question.
...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
mv .profile .profile_pre_bdib_backup mv .bashrc .bashrc_pre_bdib_backup |
The BioITeam has several useful programs, libraries, and scripts globally available on the head node, but these useful things are not available from any of the compute or interactive nodes. We will explain more about this soon, but for the time being just know that there are things that you only sometimes have access to currently, and we want you to have access to them all the time so we have to copy some things into specific locations to make sure everyone is working with the same set up throughout the course. After you have finished taking the course you may find additional useful things in the BioITeam locations, and the things that you copy may get updated from time to time. On the last day of the course we'll go through how to sync the things you have copied and how to access additional community tools that we won't use in this course, so if you like foreshadowing, you are welcome.
...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cd $WORK
mkdir src
mkdir src/BioITeam |
You may have noticed that we executed the mkdir commands sequentially. This is done to make sure that the directory exists before trying to put a new directory inside of it. This leads us to an interesting and important thing to consider. How should we name files and folders? In general you will want to adopt a consistent pattern of naming, and it should be your own and something that makes sense to you. The most important thing to get used to is the convention of using . or _ in names rather than spaces in names, and limiting your use of any other punctuation. Spaces are great for mac and windows folder names when you are using visual interfaces, but on the command line, a space is a signal to start doing something different. Imagine instead of a BioITeam folder you wanted to make it a little easier to read and wanted to call it "Bio I Team" certainly everyone would agree its easier to read that way, but because of the spaces, bash will think you want to create 3 folers, 1 named Bio another named I and a third named Team. Now this is certainly behavior you can use when appropriate to your advantage, but generally speaking spaces will not be your friend. Early on in my computational learning I was told "A computer will always do exactly what you told it to do. The trick is telling it to do what you want it to do".
Expand | |||||
---|---|---|---|---|---|
| |||||
This is hidden away to keep you from accidentally thinking that this is a good idea. If for some reason you encounter spaces in the file names or directories that you are working with, (assumably because a colleague sent you some data, and not because you thought it was a good idea personally) spaces can be "escaped" like many other special characters. Imagine someone sent you directory name "This is really annoying to use, but I don't know it yet" to change into that directory you would have to type:
Notice that the apostrophe also had to be escaped, which should help show you not to use other punctuation. |
Now that we have the directories created to copy BioITeam materials into lets copy the bin, python2.7, lib, local, and perl5 directories from the /corral-repl/utexas/BioITeam directory to your $WORK/src/BioITeam directory. Remember, that you want to copy them recursively so you get all the contents of those folders as well
Now that we have backed up your profiles so you won't lose any previous settings, you can copy our predefined GVA2016.bashrc file from the /corral-repl/utexas/BioITeam/scripts/
folder to your $HOME folder as .bashrc before using the chmod command to change the permissions to read and write for the user only.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cp /corral-repl/utexas/BioITeam/scripts/GVA2017.bashrc .bashrc
cp /corral-repl/utexas/BioITeam/scripts/GVA2017.profile .profile
chmod 700 .bashrc
chmod 700 .profile |
The chmod 700 <FILE> command marks the file as readable/writable/executable only by you. The .bashrc script file will not be executed unless it has these permissions settings.
Notice that when you do a normal ls to list the contents of your home directory, this file doesn't appear. That's because it's a hidden "dot file" – a file that has no filename, only an extension. To see these hidden files use the -a (all) switch for ls:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
ls -a
|
To see even more detail, including file permissions, add the -l (long listing) switch:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
ls -la |
Since .bashrc is executed when you login, to ensure it is set up properly you should first logout:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
exit
|
then log back in:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
ssh <username>@ls5.tacc.utexas.edu
|
If everything is working correctly you should now see a prompt like this:
No Format |
---|
tacc:~$ |
In order to make navigating to the different file systems on lonestar a little easier ($SCRATCH and $WORK), you can set up some shortcuts with these commands that create folders that "link" to those locations. Run these commands when logged into Lonestar with a terminal, from your home directory.
Code Block | ||
---|---|---|
| ||
cdh
ln -s $SCRATCH scratch
ln -s $WORK work
ln -s $BI BioITeam
|
Understanding what your .bashrc file actually does.
Expand | ||||||
---|---|---|---|---|---|---|
| ||||||
Let's look at what your .bashrc profile actually does. Use the cat command to print contents of the .bashrc file to the screen.
| List of commands to copy | |||||
cd $WORK/src/BioITeam
cp -r /corral-repl/utexas/BioITeam/bin .
cp -r /corral-repl/utexas/BioITeam/python2.7 .
cp -r /corral-repl/utexas/BioITeam/lib .
cp -r /corral-repl/utexas/BioITeam/local .
cp -r /corral-repl/utexas/BioITeam/perl5 .
cp -r /corral-repl/utexas/BioITeam/breseq . |
Some of these copy commands may take a few minutes to complete (the bin directory specifically) and you may see some permissions errors such as the following. This is expected and not concerning.
cp: cannot open `/corral-repl/utexas/BioITeam/bin/smrtanalysis-2.0.1/analysis/lib/python2.7/networkx-1.1-py2.7.egg/networkx/drawing/nx_pydot.pyc' for reading: Permission denied
When the last of the above commands has finished, copy our predefined GVA2016.bashrc file from the /corral-repl/utexas/BioITeam/scripts/
folder to your $HOME folder as .bashrc before using the chmod command to change the permissions to read and write for the user only.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cp /corral-repl/utexas/BioITeam/scripts/GVA2016.bashrc .bashrc
cp /corral-repl/utexas/BioITeam/scripts/GVA2016.profile .profile
chmod 700 .bashrc
chmod 700 .profile |
The chmod 700 <FILE> command marks the file as readable/writable/executable only by you. The .bashrc script file will not be executed unless it has these permissions settings.
Notice that when you do a normal ls to list the contents of your home directory, this file doesn't appear. That's because it's a hidden "dot file" – a file that has no filename, only an extension. To see these hidden files use the -a (all) switch for ls:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
ls -a
|
To see even more detail, including file permissions, add the -l (long listing) switch:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
ls -la |
...
As we will be using nano throughout the class, it is a good idea to review some of the basics. nano is a very simple editor available on most Linux systems. If you are able to use ssh, you can use nano. To invoke it, just type:
Code Block | ||
---|---|---|
| ||
nano
|
You'll see a short menu of operations at the bottom of the terminal window. The most important are:
- ctl-o - write out the file
- ctl-x - exit nano
You can just type in text, and navigate around using arrow keys. A couple of other navigation shortcuts: - ctl-a - go to start of line
- ctl-e - go to end of line
Warning | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
This will print several lines of text to the terminal window. Let's look at what some of these lines do with a little more information:
| to leave Lonestar by logging out|||||||||||||||||||||||||
Be careful with long lines – sometimes nano will split long lines into more than one line, which can cause problems in our commands files, as you will see.
| |||||||||||||||||||||||||
exit
|
then log back in:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
ssh <username>@ls5.tacc.utexas.edu
|
If everything is working correctly you should now see a prompt like this: tacc:~$
In order to make navigating to the different file systems on lonestar a little easier ($SCRATCH and $WORK), you can set up some shortcuts with these commands that create folders that "link" to those locations. Run these commands when logged into Lonestar with a terminal, from your home directory.
Code Block | ||
---|---|---|
| ||
cdh
ln -s $SCRATCH scratch
ln -s $WORK work
ln -s $BI BioITeam
|
Understanding what your .bashrc file actually does.
Expand | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||
Let's look at what your .bashrc profile actually does. Use the cat command to print contents of the .bashrc file to the screen.
This will print several lines of text to the terminal window. Let's look at what some of these lines do with a little more information:
|
Editing files
There are a number of options for editing files at TACC. These fall into three categories:
- Linux text editors installed at TACC (nano, vi, emacs). These run in your terminal window. vi and emacs are extremely powerful but also quite complex, so nano may be the best choice as a first local text editor.
- Text editors or IDEs that run on your local computer but have an SFTP (secure FTP) interface that lets you connect to a remote computer (Notepad++ or Komodo Edit). Once you connect to the remote host, you can navigate its directory structure and edit files. When you open a file, its contents are brought over the network into the text editor's edit window, then saved back when you save the file.
- Software that will allow you to mount your home directory on TACC as if it were a normal disk e.g. MacFuse/MacFusion for Mac, or ExpanDrive for Windows or Mac ($$, but free trial). Then, you can use any text editor to open files and copy them to your computer with the usual drag-drop.
We'll go over nano
together in class, but you may find these other options more useful for your day-to-day work so feel free to go over these sections in your free time to familiarize yourself with their workings to see if one is better for you.
...
Komodo Edit is another free, full-featured text editor with syntax coloring for many programming languages and a remote file editing interface. It has versions for both Macintosh and Windows. Download the appropriate install image here.
Once installed, start Komodo Edit and follow these steps to configure it:
- Configure the default line separator for Unix
- On the Edit menu select Preferences
- Select the New Files Category
- For Specify the end-of-line (EOL) indicator for newly created files select UNIX (\n)
- Select OK
- Configure a connection to TACC
- On the Edit menu select Preferences
- Select the Servers Category
- For Server type select SFTP
- Give this profile the Name of Lonestar
- For Hostname enter ls5.tacc.utexas.edu
- Enter your TACC user ID for Username
- Leave Port and Default path blank
- Select OK
When you want to open an existing file at Lonestar, do the following:
- Select the File menu -> Open -> Remote File
- Select your Lonestar profile from the top Server drop-down menu
- Once you log in, it should show you all the files and directories in your lonestar $HOME directory
- Navigate to the file you want and open it
- Often you will use the work or scratch directory links to help you here
To create and save a new file, do the following:
- From the Komodo Edit Start Page, select New File
- Select the file type (Text is good for commands files)
- Edit the contents
- Select the File menu -> Save As Other -> Remote File
- Select your Lonestar profile from the Server drop-down menu
- Once you log in, it should show you all the files and directories in your lonestar $HOME directory
- Navigate to where you want the put the file and save it
- Often you will use the work or scratch directory links to help you here
...
Notepad++ is an open source, full-featured text editor for Windows PCs (not Macs). It has syntax coloring for many programming languages (Python, Perl, shell), and a remote file editing interface.
If you're on a Windows PC download the installer here.
Once it has been installed, start Notepad++ and follow these steps to configure it:
- Configure the default line separator for Unix
- In the Settings menu, select Preferences
- In the Preferences dialog, select the New Document/Default Directory tab.
- Select Unix in the Format section
- Close
- Configure a connection to TACC
- In the Plugins menu, select NppFTP, then select Focus NppFTP Window. The top bar of the NppFTP panel should become blue.
- Click the Settings icon (looks like a gear), then select Profile Settings
- In the Profile settings dialog click Add new
- Call the new profile lonestar
- Fill in Hostname (ls5.tacc.utexas.edu) and your TACC user ID
- Connection type must be SFTP
- Close
To open the connection, click the blue (Dis)connect icon then select lonestar connection. It should prompt for your password. Once you've authenticated, a directory tree ending in your home directory will be visible in the NppFTP window. You can click the the (Dis)connect icon again to Disconnect when you're done.
Since much of the editing we'll do will be in your SCRATCH area at TACC, rather than having to navigate around TACC's complex file system tree, it helps to create symbolic links to your WORK and SCRATCH directory in your home directory. Then you'll be able to get there just by clicking on the scratch or work folder in the Notepad++ Remote directory tree. See below for how to do this.
...
Want your Lonestar files to appear like any other place on your hard drive? You can do this using MacFuse/MacFusion on a Mac.
Want to edit files on TACC without having to use nano
? You might want to use TextWrangler, a text editor that can edit files over ssh.
Editing Text Files on TACC: TextWrangler
TextWrangler is a recommended FreeWare text editor for MacOS X. (It even keeps with the theme TACC has going with naming its clusters!) You can use it to directly edit text files on Lonestar with OSXFuse/MacFusion using a nice GUI. It is a much more powerful text editor than TextEdit, and won't trip you up by wrapping lines etc., if you learn to use it.
Even if you cannot install OSXFuse/MacFusion, TextWrangler allows you to edit a remote file via SSH. To do this:
- Select *File > Open from FTP/SFTP Server...
- Type
ls5.tacc.utexas.edu
, your username, and your password into the appropriate boxes. - Check the You need to check the SFTP box.
- Click connect.
- You will now have a file browser window. You can create new files and edit existing files on lonsetar, but won't be able to drag-and-drop copy files.
Tip: Files beginning in a dot (.) like (.profile_user) are "hidden" and won't show up when you are navigating in Finder (if using OSXFuse/MacFusion). There is a way to turn on showing these files in finder, but it can get annoying because they will show up everywhere. If you use the TextWrangler "open" command to open a file, there is a box that you can check to show these files.
Connecting to TACC Like a Hard Drive: MacFuse/MacFusion
Here are the steps for an installation:
- Download and install FUSE for OS X.
- Check the option to install the "compatibility layer"
- Download MacFusion.
- Move the app that gets downloaded to your Applications folder
- Restart your computer.
- Open the MacFusion application.
- Click the + menu in the window and select SSHFS. Enter your login information for lonestar. Choose connect. The remote file system will appear in Finder (depending on your settings it may be on the desktop or inside the computer shortcut in the side of a Finder window). You can also click on the mounted volume within MacFusion and choose "Reveal" from the gear menu.
Copying Files To and From TACC: SFTP Clients
If you can't get OSXFuse/MacFusion to work, you can still copy files back and forth between your computer and TACC using a secure FTP (SFTP) client. Some examples of free programs for Mac are:
|
Editing files
There are a number of options for editing files at TACC. These fall into three categories:
- Linux text editors installed at TACC (nano, vi, emacs). These run in your terminal window. vi and emacs are extremely powerful but also quite complex, so nano may be the best choice as a first local text editor.
- Text editors or IDEs that run on your local computer but have an SFTP (secure FTP) interface that lets you connect to a remote computer (Notepad++ or Komodo Edit). Once you connect to the remote host, you can navigate its directory structure and edit files. When you open a file, its contents are brought over the network into the text editor's edit window, then saved back when you save the file.
- Software that will allow you to mount your home directory on TACC as if it were a normal disk e.g. MacFuse/MacFusion for Mac, or ExpanDrive for Windows or Mac ($$, but free trial). Then, you can use any text editor to open files and copy them to your computer with the usual drag-drop.
We'll go over nano
together in class, but you may find these other options more useful for your day-to-day work so feel free to go over these sections in your free time to familiarize yourself with their workings to see if one is better for you.
Expand | ||||
---|---|---|---|---|
| ||||
Komodo Edit is another free, full-featured text editor with syntax coloring for many programming languages and a remote file editing interface. It has versions for both Macintosh and Windows. Download the appropriate install image here. Once installed, start Komodo Edit and follow these steps to configure it:
When you want to open an existing file at Lonestar, do the following:
To create and save a new file, do the following:
|
Expand | ||||
---|---|---|---|---|
| ||||
Notepad++ is an open source, full-featured text editor for Windows PCs (not Macs). It has syntax coloring for many programming languages (Python, Perl, shell), and a remote file editing interface. If you're on a Windows PC download the installer here. Once it has been installed, start Notepad++ and follow these steps to configure it:
To open the connection, click the blue (Dis)connect icon then select lonestar connection. It should prompt for your password. Once you've authenticated, a directory tree ending in your home directory will be visible in the NppFTP window. You can click the the (Dis)connect icon again to Disconnect when you're done. Since much of the editing we'll do will be in your SCRATCH area at TACC, rather than having to navigate around TACC's complex file system tree, it helps to create symbolic links to your WORK and SCRATCH directory in your home directory. Then you'll be able to get there just by clicking on the scratch or work folder in the Notepad++ Remote directory tree. See below for how to do this. |
Expand | ||||
---|---|---|---|---|
| ||||
Want your Lonestar files to appear like any other place on your hard drive? You can do this using MacFuse/MacFusion on a Mac. Want to edit files on TACC without having to use Editing Text Files on TACC: TextWrangler TextWrangler is a recommended FreeWare text editor for MacOS X. (It even keeps with the theme TACC has going with naming its clusters!) You can use it to directly edit text files on Lonestar with OSXFuse/MacFusion using a nice GUI. It is a much more powerful text editor than TextEdit, and won't trip you up by wrapping lines etc., if you learn to use it. Even if you cannot install OSXFuse/MacFusion, TextWrangler allows you to edit a remote file via SSH. To do this:
Tip: Files beginning in a dot (.) like (.profile_user) are "hidden" and won't show up when you are navigating in Finder (if using OSXFuse/MacFusion). There is a way to turn on showing these files in finder, but it can get annoying because they will show up everywhere. If you use the TextWrangler "open" command to open a file, there is a box that you can check to show these files. Connecting to TACC Like a Hard Drive: MacFuse/MacFusion Here are the steps for an installation:
Copying Files To and From TACC: SFTP Clients If you can't get OSXFuse/MacFusion to work, you can still copy files back and forth between your computer and TACC using a secure FTP (SFTP) client. Some examples of free programs for Mac are: |
As we will be using nano throughout the class, it is a good idea to review some of the basics. nano is a very simple editor available on most Linux systems. If you are able to use ssh, you can use nano. To invoke it, just type:
Code Block | ||
---|---|---|
| ||
nano
|
You'll see a short menu of operations at the bottom of the terminal window. The most important are:
- ctl-o - write out the file
- ctl-x - exit nano
You can just type in text, and navigate around using arrow keys. A couple of other navigation shortcuts: - ctl-a - go to start of line
- ctl-e - go to end of line
Warning |
---|
Be careful with long lines – sometimes nano will split long lines into more than one line, which can cause problems in our commands files, as you will see. |
How should we name files and folders?
In general you will want to adopt a consistent pattern of naming, and it should be your own and something that makes sense to you. The most important thing to get used to is the convention of using . or _ in names rather than spaces in names, and limiting your use of any other punctuation. Spaces are great for mac and windows folder names when you are using visual interfaces, but on the command line, a space is a signal to start doing something different. Imagine instead of a BioITeam folder you wanted to make it a little easier to read and wanted to call it "Bio I Team" certainly everyone would agree its easier to read that way, but because of the spaces, bash will think you want to create 3 folers, 1 named Bio another named I and a third named Team. Now this is certainly behavior you can use when appropriate to your advantage, but generally speaking spaces will not be your friend. Early on in my computational learning I was told "A computer will always do exactly what you told it to do. The trick is telling it to do what you want it to do".
Expand | |||||
---|---|---|---|---|---|
| |||||
This is hidden away to keep you from accidentally thinking that this is a good idea. If for some reason you encounter spaces in the file names or directories that you are working with, (assumably because a colleague sent you some data, and not because you thought it was a good idea personally) spaces can be "escaped" like many other special characters. Imagine someone sent you directory name "This is really annoying to use, but I don't know it yet" to change into that directory you would have to type:
Notice that the apostrophe also had to be escaped, which should help show you not to use other punctuation. |
Stringing commands together and controlling their output
...
Again, you should see your answer only showing up after the cat command. Note that by using a single > you are overwriting the existing contents and that there is no warning that this is happening beware of this in the future as linux doesn't have an "undo" feature. We will make use of the redirect output (stdout) character (>
)
, and the "pass output along as input" "|" throughout the course. Not all shells are equal - the bash shell lets you redirect stdout with either >
or 1>
; stderr can be redirected with 2>
; you can redirect both stdout and stderr using &>
. If these don't work, use google to try to figure it out. The web site stackoverflow is a usually trustworthy and well annotated site for OS and shell help.
Understanding TACC
Now that we've been using lonestar for a little bit, and have it behaving in a way that is a little more useful to us, let's get more of a functional understanding of what exactly it is and how it works.
...
When you log into lonestar using ssh you are connected to what is known as the login node or "the head node". There are several different head nodes, but they are shared by everyone that is logged into lonestar (not just in this class, or from campus, or even from texas, but everywhere in the world). Anything you type onto the command line has to be executed by the head node. The longer something takes to complete, or the more it will slow down you and everybody else. Get enough people running large jobs on the head node all at once (say a classroom full of Big Data in Biology summer school students) and lonestar can actually crash leaving nobody able to execute commands or even log in for minutes -> hours -> perhaps even days if something goes really wrong. To try to avoid crashes, TACC tries to monitor things and proactively stop things before they get too out of hand. If you guess wrong on if something should be run on the head node, you may eventually see a message like the one pasted below. If you do, its not the end of the world, but repeated messages will become revoked TACC access and emails where you have to explain what you are doing to TACC and your PI and how you are going to fix it and avoid it in the future.
...
Using launcher_creator.py
We have The BioITeam created a Python script called launcher_creator.py
that makes creating a .slurm file a breeze. Before learning to work with interactive compute nodes during the class, we will show you how you will most often do your analysis. Run the launcher_creator.py script with the -h
option to show the help message so we can see what other options the script takes:
...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
# remember that things after the # sign are ignored by bash cat *.out > first_job_submission.final.output # Remember that the * wildcard will take things in alpha order, if you want you can list each file separately to control what order they go into the new file. mkdir $WORK/BDIB_GVA_20162017 mkdir $WORK/BDIB_GVA_20162017/Day1 mkdir $WORK/BDIB_GVA_20162017/Day1/first_tacc_job # each directory must be made in order to avoid getting a no such file or directory error cp first_job_submission.final.output $WORK/BDIB_GVA_20162017/Day1/first_tacc_job cp *.slurm $WORK/BDIB_GVA_20162017/Day1/first_tacc_job cp *<job-ID> $WORK/BDIB_GVA_20162017/Day1/first_tacc_job #your job-id is the string of numbers following the .o and .e filenames |
...
Expand | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
The cd command is used to change directories, the ls command lists what is in each directory, and the TAB key can be pressed in either to autocomplete paths, or double pressed to display all possible paths. The pwd command displays the full path.
|
...
This concludes the the linux and lonestar refresher tutorial.
Big Data In Biology Genome Variant Analysis Course 2016 2017 home.