...
This portion of the class is devoted to making sure we are all starting from the same starting point on lonestarstampede. This tutorial was developed as a combined version of multiple other tutorials which were previously given credit here. Anyone wishing to use this tutorial is welcome.
...
- Familiarize yourself with the way course material will be presented.
- Log into stampede2.
- Change your lonestar profile stampede2 profile to the course specific format.
- Refresh understanding of basic linux commands with some course organization.
- Review use of the nano text editor program, and become familiar with several other text editor programs.
...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cp /corral-repl/utexas/BioITeam/scriptsgva_course/GVA2021.bashrc .bashrc cp /corral-repl/utexas/BioITeam/scriptsgva_course/GVA2021.profile .profile chmod 700 .bashrc chmod 700 .profile |
...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
ssh <username>@ls5<username>@stampede2.tacc.utexas.edu |
If everything is working correctly you should now see this as your prompt:
No Format |
---|
tacc:~$ |
Warning |
---|
If you see anything besides " |
Setting up other shortcuts:
In order to make navigating to the different file systems on stampede2 a little easier ($SCRATCH and $WORK), you can set up some shortcuts with these commands that create folders that "link" to those locations. Run these commands when logged into stampede2 with a terminal, from your home directory.
Code Block | ||
---|---|---|
| ||
cdh
ln -s $SCRATCH scratch
ln -s $WORK work
ln -s $BI BioITeam
|
Several people report seeing an error message stating "ln: failed to create symbolic link 'BioITeam/BioITeam': Permission denied."
This is being investigated, but is not expected to impact today's tutorial.
Understanding what your .bashrc file actually does.
...
title | While interesting and useful information to have, understanding it is not critical to variant analysis. I suggest you to look through this information after you complete the rest of the tutorial, in your free time, or when you need to modify your profile or bashrc files in the future. |
---|
...
Let's look at what your .bashrc profile actually does. Use the cat command to print contents of the .bashrc file to the screen.
...
It is also likely or expected that upon logging in you see the following:
No Format |
---|
The following have been reloaded with a version change:
1) impi/18.0.2 => impi/17.0.3 2) intel/18.0.2 => intel/17.0.4 3) python2/2.7.15 => python2/2.7.14 |
These messages have to do with some of the core compilers and associated tools on TACC. You could use the module spider commands detailed below to find out more information of any of these modules and track down why such changes might be being made, but they are not concerning.
Warning |
---|
If you see anything besides " |
Setting up other shortcuts:
In order to make navigating to the different file systems on stampede2 a little easier ($SCRATCH and $WORK), you can set up some shortcuts with these commands that create folders that "link" to those locations. Run these commands when logged into stampede2 with a terminal, from your home directory.
Code Block | ||
---|---|---|
| ||
cdh
ln -s $SCRATCH scratch
ln -s $WORK work
ln -s $BI BioITeam
|
Several people report seeing an error message stating "ln: failed to create symbolic link 'BioITeam/BioITeam': Permission denied."
This is being investigated, but is not expected to impact today's tutorial.
Understanding what your .bashrc file actually does.
Expand | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||
|
...
Expand | ||||
---|---|---|---|---|
| ||||
Komodo Edit is another free, full-featured text editor with syntax coloring for many programming languages and a remote file editing interface. It has versions for both Macintosh and Windows. Download the appropriate install image here. Once installed, start Komodo Edit and follow these steps to configure it:
When you want to open an existing file at Lonestarstampede2, do the following:
To create and save a new file, do the following:
|
...
So you may be asking yourself what the point of using stampede2 is at all if it is wrought with so many issues. The answer comes in the form of compute nodes. There are nearly 6,000 compute nodes with different configurations that can only be accessed by a single person for a specified amount of time. For the duration of the class, each student will interact with a single compute node using an interactive DEVelopment (iDEV) session so that you get immediate feedback of seeing commands being run and know when to use the next command. This is not the typical way you will analyze your own data. Friday's tutorial will deal with the queue system.
While stampede2 is tremendously powerful and will greatly speed up your analysis, it doesn't have much in the way of a GUI (graphical user interface). The lack of a GUI means it can't visualize graphs or other meaningful representations of our data that we are used to seeing. In order to do these types of things, we have to get our data off of stampede2 and onto our own computers. This course uses the scp ("secure copy command") exclusively to move files back to your local computer, as mentioned there are other programs that can be configured to more easily transfer files back and forth as you progress in your analysis.
...
If (or when) you looked at what our edits to the .bashrc file did, you would have seen that section 1 has a series of "module load XXXX
" commands, and a promise to talk more about them later. I'm sure you will be thrilled to learn that now is that time... As a "classically trained wet-lab biologist" one of the most difficult things I have experienced in computational analysis has been in installing new programs to improve my analysis. Programs and their installation instructions tend (or appear) to be written by computational biologists in what at times feels like a foreign language, particularly when things start going wrong. Here we will discuss 4 3 ways of accessing new commands/programs/scripts and explain their benefit. This is an incomplete list of ways to install new programs to use, but is meant to be a good working example that you can adapt to install other programs in your future work.
...
Note that this may not be an inclusive list as it requires the name of the program, or its description to contain the word "alignment". Looking through the results you may notice some of the programs you already know and use for aligning 2 sequences to each other such as blast and clustalw. Try broadening your results a little by searching for "align" rather than "alignment" to see how important word choice is. When you compare the two sets of results you will see that one of the new results is:
...
Here we will download the installation file for miniconda (which we will use in the next section and throughout the course) using both scp
and wget
to compare and contrast their functionality.
3. Using miniconda on TACC
...
title | Conda environments in the instructors work |
---|
...
In the next tutorial we will start accessing the quality of some NGS reads using the fastqc program. Before we can use it, we must install it. Similar to the module system described above, to install a program via conda, we need 3 things:
...
Using wget.
In a new browser or tab navigate to https://docs.conda.io/en/latest/miniconda.html and right click on the "Miniconda3 Linux 64-bit" in the linux installers section and choose copy link address.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cd $WORK2
mkdir src
cd src |
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2-Linux-x86_64.sh |
You should see a download bar showing you the file has begun downloading, when complete the ls
command will show you a new compressed file named 'Miniconda3-py39_4.9.2-Linux-x86_64.sh'
Using scp.
This is not necessary if you followed the wget commands above. Again In a new browser or tab you would navigate to https://docs.conda.io/en/latest/miniconda.html but instead of right clicking on the "Miniconda3 Linux 64-bit" in the linux installers section and choosing copy link address you would simply left click and allow the file to download directly to your browser's Downloads folder. Using information from the SCP tutorial you would then transfer the local 'Miniconda3-py39_4.9.2-Linux-x86_64.sh' file to the stampede2 remote location '$WORK2/src'.
Given that the wget command doesn't involve having to use MFA, or the somewhat cumbersome use of 2 differnt windows, and is subject to many fewer typos, hopefully you see how wget is preferable provided left clicking on a link directly downloads a file.
Finishing conda installation, and
Regardless of what method you chose to use, the following set of commands will work to install conda. For later reference, if you are planning to install miniconda on other systems or your local laptop, the 'regular installation' links on this link may be useful.
Code Block | ||||
---|---|---|---|---|
| ||||
bash Miniconda3-py39_4.9.2-Linux-x86_64.sh
logout
#log back in using the ssh command.
conda config --set auto_activate_base false |
Following the installation prompts you will need to:
- hit enter to page through the license agreement
- enter 'yes' to agree to said license agreement
- enter to confirm the default installation location
enter 'yes' to initialize Miniconda3 by running conda init?
Code Block |
---|
Expand | ||
---|---|---|
| ||
https://anaconda.org/bioconda/fastqc If you were unable to find this page, the most likely error you entered fastqc into the search box, and you recognized that 360,000+ downloads was likely the program you wanted, you clicked the first bit of hyperlink you found which took you to the bioconda page instead of to the fastqc program. Personally, I think the entire box should be clickable to send you the program page, but nobody has asked me. |
Expand | |||||
---|---|---|---|---|---|
| |||||
While there are two other possible commands listed, I tend to always start with the simplest command and work my way from there. The other two commands deal with accessing specific labels/versions of the program. |
If all goes well, the installation command should give you the following output with you answering "y" when prompted if you actually want to install the packages:
No Format | ||||
---|---|---|---|---|
| ||||
conda install fastqc
|
Like we saw with the module system above, things aren't quite this simple. In this particular case, we get a very helpful error message that can guide our next steps:
No Format |
---|
PackagesNotFoundError: The following packages are not available from current channels:
- fastqc
Current channels:
- https://repo.anaconda.com/pkgs/main/linux-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/linux-64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page. |
...
| |
logout
#log back in using the ssh command.
conda config --set auto_activate_base false |
For help with the ssh command please refer back to Windows10 or MacOS tutorials. If you log out and back in 1 more time, what do you notice is different?
The first time you logged back in, your prompt should have looked like this:
No Format |
---|
(base) tacc:~$ |
The second time you logged back in, your prompt should go back to looking like it did before you installed conda:
No Format |
---|
tacc:~$ |
If your prompt is different, please get the instructor's attention.
Setting up your first environment
Now that you have installed conda, we want to get started with our first environment. More information about environments and their purpose can be found here, but for now we will just think about them as different sets of programs and relevant dependencies being installed together.
Code Block | ||||
---|---|---|---|---|
| ||||
conda create --name GVA2021
# enter 'y' to proceed
conda activate GVA2021 |
This will once again change your prompt. This time the expected prompt is:
Again if you see something different, you need to get the instructors attention. For the rest of the course, it is assumed that your prompt will start with (GVA2021) if not, remember that you need to use the conda activate GVA2021
command to enter the environment.
3. Using miniconda on TACC
The anaconda or miniconda interfaces to the conda system is becoming increasingly popular for controlling one's environment, streamlining new program installation, and tracking what versions of programs are being used. A comparison of the two different interfaces can be found here. The deciding factor on which interface we will use is hinted at, but not explicitly stated in the referenced comparison: TACC does not have a GUI and therefore anacondaa will not work, which is why we installed miniconda above.
Similar to the module system that TACC uses, the "conda" system allows for simple commands to download required programs/packages, and modify environmental variables (like $PATH discussed above). Two huge advantages of conda over the module system, are: #1 instead of relying on the employees at TACC to take a program and package it for use in the module system, anyone (including the same authors publishing a new tool they want the community to use) can create a conda package for a program; #2 rather than being restricted to use on the TACC clusters, conda works on all platforms (including windows and macOS), and deal with all the required dependency programs in the background for you.
Info | ||
---|---|---|
| ||
In my own work, I recently remarked to my PI that "I wish I had started using this 5 years ago", and was reminded that "it didn't exist 5 years ago, at least in its current super usable and popular format". It is entirely possible that future classes will be taught with only minimal references to the TACC module system, and this years course will feature far fewer than any previous year. |
In the next tutorial we will start accessing the quality of some NGS reads using the fastqc program. Before we can use it, we must install it. Similar to the module system described above, to install a program via conda, we need 3 things:
- Tell bash we want to use the conda program.
- Tell conda we want to install a new program.
- Name the program we want to install.
Code Block | ||||
---|---|---|---|---|
| ||||
conda activate GVA2021
conda install fastqc
|
If you have already activated your GVA2021 environment, the first line will not do anything, but if you have not, you will see your promt has changed to now say (GVA2021) on the far left of the line. As to the second command, like we saw with the module system above, things aren't quite this simple. In this particular case, we get a very helpful error message that can guide our next steps:
No Format |
---|
PackagesNotFoundError: The following packages are not available from current channels:
- fastqc
Current channels:
- https://repo.anaconda.com/pkgs/main/linux-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/linux-64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page. |
More information about "channels" can be found here. By the end of this course you may find that the 'bioconda' channel is full of lots of programs you want to use, and may choose to permanently add it to your list of channels so the above command conda install fastqc
and others used in this course would work without having to go through the intermediate of searching for the specific installation commands, or finding what channel the program you want is in. Information about how to do this, as well as more detailed information of why it is bad practice to go around adding large numbers of channels can be found here.
For now, use the error message you saw above to try to install the fastqc program yourself.
Expand | ||
---|---|---|
| ||
https://anaconda.org/bioconda/fastqc If you were unable to find this page, the most likely error you entered fastqc into the search box, and you recognized that 360,000+ downloads was likely the program you wanted, you clicked the first bit of hyperlink you found which took you to the bioconda page instead of to the fastqc program. Personally, I think the entire box should be clickable to send you the program page, but nobody has asked me. |
Expand | |||||
---|---|---|---|---|---|
| |||||
While there are two other possible commands listed, I tend to always start with the simplest command and work my way from there. The other two commands deal with accessing specific labels/versions of the program. |
If all goes well, the installation command should give you the following output with you answering "y" when prompted if you actually want to install the packages:
No Format |
---|
The following packages will be downloaded: package | build ---------------------------|----------------- fastqc-0.11.9 | hdfd78af_1 9.7 MB bioconda font-ttf-dejavu-sans-mono-2.37| h6964260_0 335 KB ------------------------------------------------------------ Total: 10.0 MB The following NEW packages will be INSTALLED: fastqc bioconda/noarch::fastqc-0.11.9-hdfd78af_1 font-ttf-dejavu-s~ pkgs/main/noarch::font-ttf-dejavu-sans-mono-2.37-h6964260_0 openjdk pkgs/main/linux-64::openjdk-8.0.152-h7b6447c_3 Proceed ([y]/n)? y Downloading and Extracting Packages fastqc-0.11.9 | 9.7 MB | ####################################################################################################################################################################################### | 100% font-ttf-dejavu-sans | 335 KB | ####################################################################################################################################################################################### | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done |
Github
This is about using the git clone
command. Git is a command often used for collaborative program development or sharing of files. Some groups also put the programs or scripts associated with a particular paper on a github project and publish the link in their paper or on their lab website.
Here we will clone the github repository for breseq which is developed by the Barrick lab here at UT and is used to comprehensively analyze haploid microbial genomes to identify all variants present. In some of the initial tutorials everyone will use a version of breseq that is available through the BioITeam, in the optional tutorials you may compile your own copy of breseq from this github project to underscore why binary files are typically preferred, or as a way of easily staying up to date on new developments with the program itself.
...
Proceed ([y]/n)? y
Downloading and Extracting Packages
fastqc-0.11.9 | 9.7 MB | ####################################################################################################################################################################################### | 100%
font-ttf-dejavu-sans | 335 KB | ####################################################################################################################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done |
There are three commonly used methods to verify you have a given program installed. You should try all three in order for the fastqc program:
Code Block language bash title
...
cd $WORK
mkdir src
cd src
If you already have a src directory, you'll get a very benign error message stating that the folder already exists and thus can not be created.
...
The 'which' command can be used to search your $PATH variable for a command with a specific name, and return the location the command is stored in which fastqc
Code Block language bash title
...
git clone https://github.com/barricklab/breseq.git
You will see several download indicators increase to 100%, and when you get your command prompt back the ls
command will show a new folder named 'breseq' containing a set of files. If you don't see said directory, or can't cd into that directory let the instructor know.
As with Trimmomatic, these files will require additional work that is somewhat specific to the specific program and there for beyond the scope of this tutorial. A link to the advanced tutorials for getting your own copy of breseq up and running will be added later in the week.
pip
This is about using the pip3 install
command. pip is the standard package manager for the common programing language python. When labs put together new analysis programs/packages, increasingly they try to make these programs available for others to use via pip. pip3 rather than just pip references the specific version of python.
Here we will install the multiqc
analysis program which compiles reports from a program called fastqc
about the quality of fastq files from multiple different samples at one time. In the later portion of the class you may choose to work with this program to get a better overall view of multiple fastq files all at once rather than clicking through individual reports.
Code Block | ||||
---|---|---|---|---|
| ||||
pip3 install --user multiqc |
*note that the "--user" option in the above code is required while working on LS5 because individual users do not have access to core systems. If you have python3 on your personal computer and wanted to install multiqc (or any other package available through pip) you would typically omit the "--user" flag.
...
Many commands accept an option of '--version' to simply access the program and return what version of the program is installed fastqc --version
Code Block language bash title Nearly all commands/programs accept "-h" or "--help" options to give you basic information about how the command or program works fastqc --help
Throughout the course, you will routinely use the above 3 commands to make sure that you have access to a given program, that it is the correct version, and to get an idea of how to construct commands to perform a given analysis step. For now, be satisfied that if you get output that is not the following that you have correctly installed fastqc. In the next tutorial we will actually use fastqc. Examples of output you do not want to see to the above commands:
/usr/bin/which: no fastqc in (<large list of directories specific to your TACC account>)
-bash: fastqc: command not found
-bash: fastqc: command not found
Github – an additional common method of getting files onto TACC
This is about using the git clone
command. Git is a command often used for collaborative program development or sharing of files. Some groups also put the programs or scripts associated with a particular paper on a github project and publish the link in their paper or on their lab website. Github repositories are a great thing to add to a single location in your $WORK2 directory.
Here we will clone the github repository for the E. coli Long-Term Evolution Experiment (LTEE) originally started by Dr. Richard Lenski. These files will be used in some of the later tutorials, and are a good source of data for identifying variants in NGS data as the variants are well documented, and emerge in a controlled manner over the course of the evolution experiment. Initially cloning a github repository as exceptionally similar to using the wget
command to download the repository, it involves typing 'git clone
' followed by a web address where the repository is stored. As we did for installing miniconda, with wget we'll clone the repository into a 'src' directory inside of $WORK2.
Code Block | ||||
---|---|---|---|---|
| ||||
which multiqc
multiqc |
...
| ||||
cd $WORK2
mkdir src
cd src |
If you already have a src directory, you'll get a very benign error message stating that the folder already exists and thus can not be created.
In a web browser navigate to github and search for 'LTEE-Ecoli' in the top right corner of the page. The only result will be for barricklab/LTEE-Ecoli; click the green box for 'clone or download' and either control/command + C on the address listed, or click the clipboard icon to copy the repository address. This image may be helpful if you are are having trouble locating the green box.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
python3git -m pip install --user multiqc |
...
clone https://github.com/barricklab/LTEE-Ecoli.git |
You will see several download indicators increase to 100%, and when you get your command prompt back the ls
command will show a new folder named 'LTEE-Ecoli' containing a set of files. If you don't see said directory, or can't cd into that directory let the instructor know.The multiqc tutorial can be found here
pip
In previous years, the pip installation program was used to install a few programs. While those programs will be installed through conda this year, the link here is provided to give a detailed walk through of how to use pip on TACC resources. This is particularly helpful for making use of the '--user' flag during the installation process as you do not have the expected permissions to install things in the default directories.
This concludes the the linux and
...
stampede2 refresher/introduction tutorial.
Genome Variant Analysis Course 2020 2021 home.