Linux and stampede2 Setup -- GVA2023
Overview:
This portion of the class is devoted to making sure we are all starting from the same starting point on stampede. This tutorial was developed as a combined version of multiple other tutorials which were previously given credit here. Anyone wishing to use this tutorial is welcome.
This is probably the longest tutorial in the entire class. It is designed to take between 1/2 and 3/4 of the first class. Do not stress if you feel people are moving through it faster than you are, or if you do not get it done before the next presentation. There will be links back to this tutorial from other tutorials as needed, and by the 2nd half of Wednesday's class when we start with the specialized tutorials, you can circle back to this tutorial as well.
Class Structure
As mentioned in the email that went out last week, the course is being offered in a hybrid format with some participants being in person and some attending only on zoom. Everyone is welcome to attend in either format on any day. If you have any questions please dont hesitate to reach out. Each day during class, I'll walk around the room while you work on the tutorials and look at various screens to see if I notice any issues individuals are running into, but please just get my attention if you know you are running into a problem. For those on zoom I'll have an ear bud in my ear so the easiest way to get my attention will be to just unmute and say something, but i will also circle by my computer and check if anyone has sent a chat message about an issue.
Objectives:
- Familiarize yourself with the way course material will be presented.
- Log into stampede2.
- Change your stampede2 profile to the course specific format.
- Refresh understanding of basic linux commands with some course organization.
- Review use of the nano text editor program, and become familiar with several other text editor programs.
Example things you will encounter in the course:
As this is the first real tutorial you are encountering in this course, some housekeeping matters to familiarize you with how information will be presented.
Code blocks
There will be 4 types of code blocks used throughout this class. Text inside of code blocks represent at least 1 possible correct answer, and should either be typed EXACTLY into the terminal window as they are, or copy pasted. There is a notable exception that text between <> symbols represent something that you need to replace before sending it to the terminal. Yes, the <> marks themselves also need to be replaced. We try to put informative text within the brackets so you know what to replace it with. If you are ever unsure of what to replace the <> text with, just ask.
- Visible
- These are code blocks that you would likely have no idea what to type without help. This is common when a new command is being introduced, or when things you might be able to guess at are wrong in some way.
- These will typically be associated with longer/more detailed text above the text box explaining things.
An example code block showing you the command you need to type into the prompt to list what directory you are currently in:
pwd
- Hinted
- These are code blocks that you can probably figure out what to type with a hint that goes beyond what the tutorial is requesting. Access the hint by clicking the triangle or hint hyperlink text.
- These exist to force you to think about what command you need, and hopefully make some connections to help you remember what you will need to type in the future.
- These should all come with additional explanation as to what is going on.
- Rather than just expanding these by reflex, I strongly suggest seeing if you can figure out what the command will be, and checking your work
Example:
- Hidden:
- These code blocks represent things that you should have seen or used several times already, or things that can be succinctly explained.
Example:
Speed bump:
- This combines the previous 2 types to deliberately slow you down and be cumbersome.
- If you find yourself consistently wrong about what eventually shows up in the text box, slow down, step back, think about whats going on, and consider asking a question.
- These should only come after you have seen the same (or very similar) commands in the other formats previously
Example:
Warnings
Why the tutorials have warnings?
Warnings exist for 2 reasons:
- Something you are about to do can have negative impact on you
- You saw an example of this talking about paying attention to warnings when using ssh to access new remote computers
- Something you are about to do can have negative impacts on others
- this will be related mostly to the use of "idev" sessions beginning tomorrow.
Info boxes
These are used to give more general background about things
These were introduced in the last few years, but despite requests in post-class surveys, not much feedback was provided about them. If you find them useful (or have ideas of how they might be more useful) please remember to mention them in the post class survey. At very least the hope is that they help organize information. The information in these boxes is not needed to complete the tutorials.
Tip boxes
Things I wish I knew sooner
Two examples that will help you throughout the course:
- On the command line, you can use the tab key to try to autofill the "rest" of whatever you are typing, whether it is the name of the directory, a long file, or even a command. Hitting tab twice will list all possible matches to whatever you have already typed when there are multiple different possibilities. The more you use this, the fewer typos you will have as a typo can't autofill.
- You can use the up and down arrows to scroll through your previously typed commands. This can be especially helpful when you have typed a long command and get an error because of a typo as rather than retyping the entire thing and risking a new typo, you can just hit the up arrow and correct the error.
Tutorial:
Logging into stampede2
I think everyone was able to log into stampede2 last week as part of the pre-class assignment. If not make sure the instructor is aware as there are additional elements that still need to be addressed (potentially adding you to the project allocation and definitely being added to the reservation that we will use starting tomorrow).
When prompted enter your password, and digital security code from the app, and answer "yes" to the security question if you see one. If you previously have logged in you will not see such a question prompt.
Logging into remote computers
You are blindly told to enter yes here, only because you are given a command above to copy which will take you to a remote computer system that I know to be safe, and as this is an introductory class, it is likely you have not logged into it before. If you have previously logged into this remote computer from the local computer you are sitting at, you will not be issued a security warning prompt.
The same will be true the first time you log into any of the other TACC resource, or other remote computer. This means that it should be rare that you encounter such a prompt, and more rare still that you are surprised to find one. If you ever see a security warning logging into somewhere that you use commonly you should answer no and try to figure out why you were warned. If you are not surprised to encounter it, if you have figured out why you encountered it, or understand the risks, type "yes" to bypass the security check.
As a reminder, the ssh command, and launching programs to give you the prompt to type them was provided as part of the pre-class assignment. Convenient links incase you need them or want to refresh your memory:
Setting up your stampede2 profile
There are many flavors of Linux/Unix shells. The default for TACC's Linux (and most other Linuxes) is bash (bourne again shell), which we will use throughout. I am not aware of any others being used by biologists, so this is likely just something you will always default to.
Whenever you login via an interactive shell as you did above, a well-known script is executed by the shell to establish your favorite environment settings. I've set up a common profile for you to start with that will help you know where you are in the file system and make it easier to access some of our shared resources. If you already have a profile set up on stampede2 that you like, we want to make sure that we don't destroy it but it is critical to make sure that we change it temporarily so everyone is working from the same place through the class. Use the ls command to check if you have a profile already set up in your home directory.
If you already have a .profile or .bashrc file, use the mv command to change the name to something descriptive (for example ".profile_pre_GVA_backup"). Otherwise continue to creating a new files.
A warning about deleting files
Most of us are used to having an 'undo' button, trash/recycling collection of deleted files, or warnings when we tell a computer to do something that can't be undone. The command line offers none of these options. In extreme situations on TACC, you can use the help desk ticket system to recover a deleted file, but there is no guarantee files can be recovered under normal circumstances (we will cover exceptions to this later).
The specific warning right now is that if you have an existing profile, and have not done the above commands correctly, you will not be able to recover your existing profile. Thus this is a great opportunity to interact with your instructor and make 100% the above steps have been correctly performed. Type ls -al
onto the command line
Now that we have backed up your profiles so you won't lose any previous settings, you can copy our predefined GVA.bashrc file from the /corral-repl/utexas/BioITeam/gva_course/
folder to your $HOME folder as .bashrc and the predefined GVA.profile as .profile from the same location before using the chmod command to change the permissions to read and write for the user only.
Future reference regarding bashrc and profile files
If these files are updated for future classes, these existing versions that you are working with now will be copied to the same location but listed as "GVA2023" instead of just GVA. This is unlikely to be relevant but if you are working with this 12+ months from now be aware.
The chmod 700 <FILE> command marks the file as readable/writable/executable only by you. The .bashrc script file will not be executed unless it has these permissions settings.
Understanding why some files start with a "."
In the above code box, you see that the names start with a . when a filename starts with a . it conveys a special meaning to the operating system/command line. Specifically, it prevents that file from being displayed when you use the ls
command unless you specifically ask for hidden files to be displayed using the -a
option. Such files are termed "dot-files" if you are interested in researching them further.
Let's look at a few different ways we will use the ls
command throughout the course. Compare the output of the following 4 commands:
ls #ignore everything that comes after the # mark. There is a problem on this wiki page but things after a # wont effect commands
ls -a
ls -a -1
ls -a -l
Throughout the course you will notice that many options are supplied to commands via a single dash immediately followed by a single letter. Usually when you have multiple commands supplied in this manner you can combine all the letters after a single dash to make things easier/faster to type. Experiment a little to prove to yourself that the following 2 commands give the same output.
ls -a -1 ls -al
While knowing that you can combine options in this way helps you analyze data faster/better, the real value comes from being able to decipher commands you come across on help forums, or in publications.
For ls specifically the following association table is worth making note of, but if you want the 'official' names consider using the man
command to bring up the ls manual.
flag | association |
---|---|
-a | "all" files |
-l | "long" listing of file information |
-1 | 1 column |
-h | human readable |
Getting back to your profile... Since .bashrc is executed when you login, to ensure it is set up properly you should first logout:
then log back in:
If everything is working correctly you should now see this as your prompt:
tacc:~$
It is also likely or expected that upon logging in you see the following:
The following have been reloaded with a version change: 1) impi/18.0.2 => impi/17.0.3 2) intel/18.0.2 => intel/17.0.4 3) python2/2.7.15 => python2/2.7.14
These messages have to do with some of the core compilers and associated tools on TACC. You could use the module spider commands detailed below to find out more information of any of these modules and track down why such changes might be being made, but they are not concerning.
If you see anything besides "tacc:~$
" as your prompt, get my attention rather than continuing forward as something has gone wrong.
Setting up other shortcuts:
In order to make navigating to the different file systems on stampede2 a little easier ($SCRATCH and $WORK), you can set up some shortcuts with these commands that create folders that "link" to those locations. Run these commands when logged into stampede2 with a terminal, from your home directory.
cdh ln -s $SCRATCH scratch ln -s $WORK work ln -s $BI BioITeam
In previous years, several people have report seeing an error message stating "ln: failed to create symbolic link 'BioITeam/BioITeam': Permission denied."
This seems to be related to different project allocations. I do not think it will be an issue for anyone this year.
Understanding what your .bashrc file actually does.
Editing files
There are a number of options for editing files at TACC. These fall into three categories:
- Linux text editors installed at TACC (nano, vi, emacs). These run in your terminal window. vi and emacs are extremely powerful but also quite complex, so nano is the best choice as a first local text editor. It is also powerful enough that you can still accomplish whatever you are working on, it just might be more difficult if you try to do more complex edits. If you are already familiar with one of the other programs you are welcome to continue using it. If this is something you plan to use long term, it is worth spending the time to learn to rely on something other than nano after this class.
- A former lab member suggested that vs code may be the best current platform to combine much of this, and while I trust his experience and suggestion I don't have personal familiarity with it https://code.visualstudio.com/docs/remote/ssh .
- Text editors or IDEs that run on your local computer but have an SFTP (secure FTP) interface that lets you connect to a remote computer (Notepad++ or Komodo Edit). Once you connect to the remote host, you can navigate its directory structure and edit files. When you open a file, its contents are brought over the network into the text editor's edit window, then saved back when you save the file.
- Software that will allow you to mount your home directory on TACC as if it were a normal disk e.g. MacFuse/MacFusion for Mac, or ExpanDrive for Windows or Mac ($$, but free trial). Then, you can use any text editor to open files and copy them to your computer with the usual drag-drop.
We'll go over nano
together in class, but you may find these other options more useful for your day-to-day work so feel free to go over these sections in your free time to familiarize yourself with their workings to see if one is better for you.
As we will be using nano throughout the class, it is a good idea to review some of the basics. nano is a very simple editor available on most Linux systems. If you are able to use ssh, you can use nano. To invoke it, just type:
nano
You'll see a short menu of operations at the bottom of the terminal window. The most important are:
- ctl-o - write out the file
- ctl-x - exit nano
You can just type in text, and navigate around using arrow keys. A couple of other navigation shortcuts: - ctl-a - go to start of line
- ctl-e - go to end of line
Be careful with long lines – sometimes nano will split long lines into more than one line, which can cause problems in our commands files, and if you copy paste code into a nano editor.
What can you do to see contents of a file without opening it for editing?
Command | useful for | bad if |
---|---|---|
head | seeing the first lines of a file (10 by default) | file is binary |
tail | seeing the last lines of a file (10 by default) | file is binary |
cat | print all lines of a file to the screen | the file is big and/or binary |
less | opens the entire file in a separate program but does not allow editing | if you are going to type a new command based on the content, or forget the q key exits the view, or file is binary |
more | prints 1 page worth of a file to the screen, can hold enter key down to see next line repeatedly. Contents will remain when you scroll back up. | you forget that you hit the q key to stop stop looking at the file, or file is binary |
Note that all of the above state that it is bad to view binary files. Binary files exist for computers to read, not humans, and are thus best ignored. We'll go over this in more detail as well as some conversion steps when we deal with .sam and .bam files later in the course.
How should we name files and folders?
In general you will want to adopt a consistent pattern of naming, and it should be your own and something that makes sense to you. After that there are some tips:
- The most important thing to get used to is the convention of using . _ or capitalizing the first letter in each word in names rather than spaces in names, and limiting your use of any other punctuation. Spaces are great for mac and windows folder names when you are using visual interfaces, but on the command line, a space is a signal to start doing something different. Imagine instead of a BioITeam folder you wanted to make it a little easier to read and wanted to call it "Bio Informatics Team" certainly everyone would agree its easier to read that way, but because of the spaces, bash will think you want to create 3 folders, 1 named Bio another named Informatics and a third named Team. Now this is certainly behavior you can use when appropriate to your advantage, but generally speaking spaces will not be your friend. Early on in my computational learning I was told "A computer will always do exactly what you told it to do. The trick is correctly telling it to do what you want it to do".
- Name things something that makes it obvious to you what the contents are not just today but next week, next month, and next year even if you don't touch the it for weeks-months-years.
- Prefixing file/folder names with international date format (YYYY-MM-DD) will ensure that listing the contents will print in an order in which they were created. This can be useful when doing the same or similar analysis on new samples as new data is generated.
Understanding TACC
Now that we've been using stampede2 for a little bit, and have it behaving in a way that is a little more useful to us, let's get more of a functional understanding of what exactly it is and how it works.
Diagram of Stampede2 directories: What connects to what, how fast, and for how long.
Stampede2 is a computer cluster connected to three file servers (each with unique characteristics), and other computer infrastructure. For the purpose of this class, and your own work, you only need to understand the basics of the 3 file servers to know how to use them effectively. The 3 servers are named, "HOME", "WORK", and "SCRATCH", and we will work with them all over the next 5 days
$HOME | $WORK | $SCRATCH | |
---|---|---|---|
Purged? | No | No | Files can be purged if not accessed for 10 days. |
Backed Up? | Yes | No | No |
Capacity | 10GB | 1TB | Basically infinite. |
Commands to Access | cdh cd $HOME/ | cdw cd $WORK/ | cds cd $SCRATCH/ |
Purpose | Store Executables | Store Files and Programs | Run Jobs |
Time spent | When modifying basic settings | When installing new programs; Storing raw or final data | When analyzing data |
Executables that aren't available on TACC through the "module" command should be stored in $HOME.
If you plan to be using a set of files frequently or would like to save the results of a job, they should be stored in $WORK. While 1TB may seem like a lot of space you can easily fill it up with just a few sequencing projects, particularly if you store files in a non-compressed manner, or wish to store analyzed intermediates. Best practice would be to store the most important files (raw > scripts > final > analysis files) on a system such as corral or backed up to something else such as UTbox.
If you're going to run a job, it's a good idea to keep your input files in a directory in $WORK and copy them to a directory in $SCRATCH where you plan to run your job.
cp $WORK/my_fastq_data/*fastq $SCRATCH/my_project/
Understanding "jobs" and compute nodes.
When you log into stampede2 using ssh you are connected to what is known as the login node or "the head node". There are several different head nodes, but they are shared by everyone that is logged into stampede2 (not just in this class, or from campus, or even from Texas, but everywhere in the world). Anything you type onto the command line has to be executed by the head node. The longer something takes to complete, or the more commands you send at once the slower the head node will work for you and everybody else. Get enough people running large jobs on the head node all at once (say a class of summer school students) and stampede2 can actually crash leaving nobody able to execute commands or even log in for minutes -> hours -> perhaps even days if something goes really wrong. To try to avoid crashes, TACC tries to monitor things and proactively stop things before they get too out of hand. If you guess wrong on if something is safe to run on the head node, you may eventually see a message like the one pasted below. If you do, it's not the end of the world, but repeated messages will lead to revoked TACC access and emails where you have to explain what you are doing to TACC and your PI and how you are going to fix it and avoid it in the future.
Message from root@login1.ls4.tacc.utexas.edu on pts/127 at 09:16 ... Please do not run scripts or programs that require more than a few minutes of CPU time on the login nodes. Your current running process below has been killed and must be submitted to the queues, for usage policy see http://www.tacc.utexas.edu/user-services/usage-policies/ If you have any questions regarding this, please submit a consulting ticket.
So you may be asking yourself what the point of using stampede2 is at all if it is wrought with so many issues. The answer comes in the form of compute nodes. There are nearly 6,000 compute nodes with different configurations that can only be accessed by a single person for a specified amount of time. For the duration of the class, each student will interact with a single compute node using an interactive DEVelopment (iDEV) session so that you get immediate feedback of seeing commands being run and know when to use the next command. This is not the typical way you will analyze your own data. Friday's tutorial will deal with the queue system.
While stampede2 is tremendously powerful and will greatly speed up your analysis, it doesn't have much in the way of a GUI (graphical user interface). The lack of a GUI means it can't visualize graphs or other meaningful representations of our data that we are used to seeing. In order to do these types of things, we have to get our data off of stampede2 and onto our own computers. This course uses the scp ("secure copy command") exclusively to move files back to your local computer, as mentioned there are other programs that can be configured to more easily transfer files back and forth as you progress in your analysis.
Transferring files to and from stampede2 with scp
When this class was taught in person, it was helpful to have a small set of steps on transferring files between stampede2 and your local computer which tended to give people problems. The idea being that some problems on the first day would eventually work themselves out through the week as the SCP command was repeatedly used. Given how zoom makes it more difficult for problems to be identified, this tutorial has been moved to its own tutorial page so that it can be more referenced more easily when files are to be transferred in future tutorials. for now, focus on transferring the README file from the BioITeam. Once done with the transfer tutorial come back to this page to install a few extra programs and learn about the module system.
scp tutorial page.
Moving beyond the preinstalled commands on TACC
If (or when) you looked at what our edits to the .bashrc file did, you would have seen that section 1 has a series of "module load XXXX
" commands, and a promise to talk more about them later. I'm sure you will be thrilled to learn that now is that time... As a "classically trained wet-lab biologist" one of the most difficult things I have experienced in computational analysis has been in installing new programs to improve my analysis. Programs and their installation instructions tend (or appear) to be written by computational biologists in what at times feels like a foreign language, particularly when things start going wrong. Here we will discuss 3 ways of accessing new commands/programs/scripts and explain their benefit. This is an incomplete list of ways to install new programs to use, but is meant to be a good working example that you can adapt to install other programs in your future work.
1. TACC modules
Modules are programs or sets of programs that have been set up to run on TACC. They make managing your computational environment very easy. All you have to do is load the modules that you need and a lot of the advanced wizardry needed to set up the linux environment has already been done for you. New commands just appear.
To see all modules available in the current context, type:
module avail
Remember you can hit the "q" key to exit out of the "more" system, or just keep hitting return to see all of the modules available. The "module avail
" command is not the most useful of commands if you have some idea of what you are looking for. For example imagine you want to align a few million next generation sequencing reads to a genome, but you don't know what your options are. You can use the following command to get a list of programs that may be useful:
module keyword alignment
Note that this may not be an inclusive list as it requires the name of the program, or its description to contain the word "alignment". Looking through the results you may notice some of the programs you already know and use for aligning 2 sequences to each other such as blast. Try broadening your results a little by searching for "align" rather than "alignment" to see how important word choice is. When you compare the two sets of results you will see that one of the new results is:
bsmap: bsmap/2.92 BSMAP for Methylation
If you are sure you know the name of the program you need this list may be sufficient, but if you don't know exactly what you need the limited information available is probably not enough to make a good decision. To learn more about a particular program, try the following 2 commands:
module spider bowtie module spider bowtie/2.3.2
In the first case, we see information about what versions of bowtie stampede2 has available for us, but really that is just the same information as we had from our previous search. This can be particularly useful when you know what program you want to use but don't know what versions are available. In the second case we now have more detailed information about the particular version of interest including websites we can go to to learn more about the program itself.
Once you have identified the module that you want to use, you install it using the following command:
module load bowtie/2.3.2
Using the version numbers for module commands
While not always strictly necessary, using the version number (in this case "/2.3.2
") is a very good habit to get into as it controls what version is to be loaded. In this case the because there are 2 very different versions available (2.3.2
and 1.2.1.1
) module load bowtie
will actually throw an error which tells you to use the module spider command to figure out how to correctly load the module.
While it is tempting to only use "module load name" without the version numbers, using the version numbers can help keep track of what versions were used for referencing in your future publications, and make it easier to identify what went wrong when scripts that have been working for months or years suddenly stop working (ie TACC changed the default version of a program you are using).
This is one of the big advantages of using the conda system we will describe shortly, it easily keeps track of all versions of all programs you use.
Since the module load command doesn't give any output, it is often useful to check what modules you have installed with either of the following commands:
module list module list bowtie
The first example will list all currently installed modules while the second will only list modules containing bowtie in the name. If you see that you have installed the wrong version of something, a module is conflicting with another, or just don't feel like having it turned on anymore, use the following command:
module unload bowtie
You will notice when you type module list you have several different modules installed already. These come from both TACC defaults (TACC, linux, etc), and several that are used so commonly both in this class and by biologists that it becomes cumbersome to type "module load python3
" all the time and therefore we just have them turned on by default by putting them in our profile to load on startup. As you advance in your own data analysis you may start to find yourself constantly loading modules as well. When you become tiered of doing this (or see jobs fail to run because the modules that load on the compute nodes are based on your .bashrc file plus commands given to each node), you may want to add additional modules to your .bashrc file. This can be done using the "nano .bashrc" command from your home directory.
2. Downloading from the web directly to TACC
When files are hosted online as direct downloads, you can use the wget
(Web get) command to skip your local computer and download the file directly to TACC. Typically this makes use of the "Copy Link Address" option when you right click on a link in a web browser that would otherwise start a download to your computer.
Here we will download the installation file for miniconda (which we will use in the next section and throughout the course) using both scp
and wget
to compare and contrast their functionality.
Using wget.
In a new browser or tab navigate to https://docs.conda.io/en/latest/miniconda.html and right click on the "Miniconda3 Linux 64-bit" in the linux installers section and choose copy link address.
You should see a download bar showing you the file has begun downloading, when complete the ls
command will show you a new compressed file named 'Miniconda3-latest-Linux-x86_64.sh'
Using scp.
This is not necessary if you followed the wget commands above. Again In a new browser or tab you would navigate to https://docs.conda.io/en/latest/miniconda.html but instead of right clicking on the "Miniconda3 Linux 64-bit" in the linux installers section and choosing copy link address you would simply left click and allow the file to download directly to your browser's Downloads folder. Using information from the SCP tutorial you would then transfer the local 'Miniconda3-latest-Linux-x86_64.sh' file to the stampede2 remote location '$WORK/src'. Note that you are downloading a file that will work on TACC, but not on your own computer. Don't get confused thinking you need windows or mac versions.
Given that the wget command doesn't involve having to use MFA, or the somewhat cumbersome use of 2 different windows, and is subject to many fewer typos, hopefully you see how wget is preferable provided left clicking on a link directly downloads a file.
Finishing conda installation, and
Regardless of what method you chose to use, the following set of commands will work to install conda. For later reference, if you are planning to install miniconda on other systems or your local laptop, the 'regular installation' links on this link may be useful.
bash Miniconda3-latest-Linux-x86_64.sh
Following the installation prompts you will need to:
- hit enter to page through the license agreement
- enter 'yes' to agree to said license agreement
- enter to confirm the default installation location
enter 'yes' to initialize Miniconda3 by running conda init?
logout #log back in using the ssh command. conda config --set auto_activate_base false conda config --set channel_priority strict
For help with the ssh command please refer back to Windows10 or MacOS tutorials. If you log out and back in 1 more time, what do you notice is different?
The first time you logged back in, your prompt should have looked like this:
(base) tacc:~$
The second time you logged back in, your prompt should go back to looking like it did before you installed conda:
tacc:~$
If your prompt is different, please get the instructor's attention.
Setting up your first environment
Now that you have installed conda, we want to get started with our first environment. More information about environments and their purpose can be found here, but for now we will just think about them as different sets of programs and relevant dependencies being installed together.
conda create --name GVA-fastqc # enter 'y' to proceed conda activate GVA-fastqc
This will once again change your prompt. This time the expected prompt is:
(GVA-fastqc) tacc:~$
Again if you see something different, you need to get the instructors attention. For the rest of the course, it is assumed that your prompt will start with (GVA-program_name) if not, remember that you need to use the conda activate GVA-program_name
command to enter the environment.
3. Using miniconda on TACC
The anaconda or miniconda interfaces to the conda system is becoming increasingly popular for controlling one's environment, streamlining new program installation, and tracking what versions of programs are being used. A comparison of the two different interfaces can be found here. The deciding factor on which interface we will use is hinted at, but not explicitly stated in the referenced comparison: TACC does not have a GUI and therefore anaconda will not work, which is why we installed miniconda above.
Similar to the module system that TACC uses, the "conda" system allows for simple commands to download required programs/packages, and modify environmental variables (like $PATH discussed above). Two huge advantages of conda over the module system, are: #1 instead of relying on the employees at TACC to take a program and package it for use in the module system, anyone (including the same authors publishing a new tool they want the community to use) can create a conda package for a program; #2 rather than being restricted to use on the TACC clusters, conda works on all platforms (including windows and macOS), and deal with all the required dependency programs in the background for you.
Conda environments in the instructors work
In my own work, I recently remarked to my PI that "I wish I had started using this years ago", and was reminded that "it didn't exist years ago, at least in its current super usable and popular format". It is entirely possible that future classes will be taught with only minimal references to the TACC module system, and this years course will feature far fewer than any previous year.
While you may be thinking that since the conda system can work on your personal computer, you may want to just work on your personal computer for the duration of this class and ignore all the ssh commands and working remotely. This is strongly not advised. While you would be able to use the same programs in both instances (in most cases), the tutorials are developed with the speed of the stampede2 system in mind and attempt to minimize "waiting for something to finish" to how long it takes someone to read through the next block of text on the tutorial with some exceptions. If you were to do these tutorials on your personal computer, the timing would significantly increase and it would be difficult to keep up with the rest of the class.
In Friday's lecture I will explain why installing and using conda on your local computer is still a good idea and how I am currently using it in conjuncture with TACC.
In the next tutorial we will start accessing the quality of some NGS reads using the fastqc program. Before we can use it, we must install it. Similar to the module system described above, to install a program via conda, we need 3 things:
- Tell bash we want to use the conda program.
- Tell conda we want to install a new program.
- Name the program we want to install.
conda activate GVA-fastqc conda install fastqc
If you have already activated your GVA-fastqc environment, the first line will not do anything, but if you have not, you will see your prompt has changed to now say (GVA-fastqc) on the far left of the line. As to the second command, like we saw with the module system above, things aren't quite this simple. In this particular case, we get a very helpful error message that can guide our next steps:
PackagesNotFoundError: The following packages are not available from current channels: - fastqc Current channels: - https://repo.anaconda.com/pkgs/main/linux-64 - https://repo.anaconda.com/pkgs/main/noarch - https://repo.anaconda.com/pkgs/r/linux-64 - https://repo.anaconda.com/pkgs/r/noarch To search for alternate channels that may provide the conda package you're looking for, navigate to https://anaconda.org and use the search bar at the top of the page.
More information about "channels" can be found here.
Conda Channels
By the end of this course you may find that the 'bioconda' channel is full of lots of programs you want to use, and may choose to permanently add it to your list of channels so the above command conda install fastqc
and others used in this course would work without having to go through the intermediate of searching for the specific installation commands, or finding what channel the program you want is in. Information about how to do this, as well as more detailed information of why it is bad practice to go around adding large numbers of channels can be found here. Similarly, when we get to the read mapping tutorial, we will go over the conda-forge channel which is also very helpful to have. There will be a post class tutorial of how to add channels permanently if you feel this is something you would benefit from after the class.
For now, use the error message you saw above to try to install the fastqc program yourself.
If all goes well, the installation command should give you output similar to the following with you answering "y" when prompted if you actually want to install the packages:
The following packages will be downloaded: package | build ---------------------------|----------------- dbus-1.13.18 | hb2f20db_0 504 KB fastqc-0.11.9 | hdfd78af_1 9.7 MB bioconda font-ttf-dejavu-sans-mono-2.37| hd3eb1b0_0 335 KB glib-2.69.1 | h4ff587b_1 1.7 MB libxml2-2.9.14 | h74e7548_0 718 KB openjdk-11.0.13 | h87a67e3_0 341.0 MB ------------------------------------------------------------ Total: 354.0 MB The following NEW packages will be INSTALLED: _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu dbus pkgs/main/linux-64::dbus-1.13.18-hb2f20db_0 expat pkgs/main/linux-64::expat-2.4.4-h295c915_0 fastqc bioconda/noarch::fastqc-0.11.9-hdfd78af_1 font-ttf-dejavu-s~ pkgs/main/noarch::font-ttf-dejavu-sans-mono-2.37-hd3eb1b0_0 fontconfig pkgs/main/linux-64::fontconfig-2.13.1-h6c09931_0 freetype pkgs/main/linux-64::freetype-2.11.0-h70c0345_0 glib pkgs/main/linux-64::glib-2.69.1-h4ff587b_1 icu pkgs/main/linux-64::icu-58.2-he6710b0_3 libffi pkgs/main/linux-64::libffi-3.3-he6710b0_2 libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 libpng pkgs/main/linux-64::libpng-1.6.37-hbc83047_0 libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 libuuid pkgs/main/linux-64::libuuid-1.0.3-h7f8727e_2 libxcb pkgs/main/linux-64::libxcb-1.15-h7f8727e_0 libxml2 pkgs/main/linux-64::libxml2-2.9.14-h74e7548_0 openjdk pkgs/main/linux-64::openjdk-11.0.13-h87a67e3_0 pcre pkgs/main/linux-64::pcre-8.45-h295c915_0 perl pkgs/main/linux-64::perl-5.26.2-h14c3975_0 xz pkgs/main/linux-64::xz-5.2.5-h7f8727e_1 zlib pkgs/main/linux-64::zlib-1.2.12-h7f8727e_2 Proceed ([y]/n)? y Downloading and Extracting Packages fastqc-0.11.9 | 9.7 MB | ####################################################################################################################################################################################### | 100% font-ttf-dejavu-sans | 335 KB | ####################################################################################################################################################################################### | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done
There are three commonly used methods to verify you have a given program installed. You should try all three in order for the fastqc program:
- The 'which' command can be used to search your $PATH variable for a command with a specific name, and return the location the command is stored in
which fastqc
- Many commands accept an option of '--version' to simply access the program and return what version of the program is installed
fastqc --version
- Nearly all commands/programs accept "h" or "-help" options to give you basic information about how the command or program works
fastqc --help
Throughout the course, you will routinely use the above 3 commands to make sure that you have access to a given program, that it is the correct version, and to get an idea of how to construct commands to perform a given analysis step. For now, be satisfied that if you get output that is not the following that you have correctly installed fastqc. In the next tutorial we will actually use fastqc. Examples of output you do not want to see to the above commands:
/usr/bin/which: no fastqc in (<large list of directories specific to your TACC account>)
-bash: fastqc: command not found
-bash: fastqc: command not found
Github – an additional common method of getting files onto TACC
This is about using the git clone
command. Git is a command often used for collaborative program development or sharing of files. Some groups also put the programs or scripts associated with a particular paper on a github project and publish the link in their paper or on their lab website. Github repositories are a great thing to add to a single location in your $WORK directory.
Here we will clone the github repository for the E. coli Long-Term Evolution Experiment (LTEE) originally started by Dr. Richard Lenski. These files will be used in some of the later tutorials, and are a good source of data for identifying variants in NGS data as the variants are well documented, and emerge in a controlled manner over the course of the evolution experiment. Initially cloning a github repository as exceptionally similar to using the wget
command to download the repository, it involves typing 'git clone
' followed by a web address where the repository is stored. As we did for installing miniconda, with wget we'll clone the repository into a 'src' directory inside of $WORK.
If you already have a src directory, you'll get a very benign error message stating that the folder already exists and thus can not be created.
In a web browser navigate to github and search for 'LTEE-Ecoli' in the top right corner of the page. The only result will be for barricklab/LTEE-Ecoli; click the green box for 'clone or download' and either control/command + C on the address listed, or click the clipboard icon to copy the repository address. This image may be helpful if you are are having trouble locating the green box.
You will see several download indicators increase to 100%, and when you get your command prompt back the ls
command will show a new folder named 'LTEE-Ecoli' containing a set of files. If you don't see said directory, or can't cd into that directory let the instructor know.
pip
In previous years, the pip installation program was used to install a few programs. While those programs will be installed through conda this year, the link here is provided to give a detailed walk through of how to use pip on TACC resources. This is particularly helpful for making use of the '--user' flag during the installation process as you do not have the expected permissions to install things in the default directories.
This concludes the the linux and stampede2 refresher/introduction tutorial.
Genome Variant Analysis Course 2023 home.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.