...
Here we will download the installation file for miniconda (which we will use in the next section and throughout the course) using both scp
and wget
to compare and contrast their functionality.
3. Using miniconda on TACC
...
title | Conda environments in the instructors work |
---|
...
In the next tutorial we will start accessing the quality of some NGS reads using the fastqc program. Before we can use it, we must install it. Similar to the module system described above, to install a program via conda, we need 3 things:
- Tell bash we want to use the conda program.
- Tell conda we want to install a new program.
- Name the program we want to install.
Code Block | ||||
---|---|---|---|---|
| ||||
conda install fastqc
|
Like we saw with the module system above, things aren't quite this simple. In this particular case, we get a very helpful error message that can guide our next steps:
No Format |
---|
PackagesNotFoundError: The following packages are not available from current channels:
- fastqc
Current channels:
- https://repo.anaconda.com/pkgs/main/linux-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/linux-64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page. |
...
Expand | ||
---|---|---|
| ||
https://anaconda.org/bioconda/fastqc If you were unable to find this page, the most likely error you entered fastqc into the search box, and you recognized that 360,000+ downloads was likely the program you wanted, you clicked the first bit of hyperlink you found which took you to the bioconda page instead of to the fastqc program. Personally, I think the entire box should be clickable to send you the program page, but nobody has asked me. |
Expand | |||||
---|---|---|---|---|---|
| |||||
While there are two other possible commands listed, I tend to always start with the simplest command and work my way from there. The other two commands deal with accessing specific labels/versions of the program. |
If all goes well, the installation command should give you the following output with you answering "y" when prompted if you actually want to install the packages:
...
Using wget.
In a new browser or tab navigate to https://docs.conda.io/en/latest/miniconda.html and right click on the "Miniconda3 Linux 64-bit" in the linux installers section and choose copy link address.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cd $WORK2
mkdir src
cd src |
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2-Linux-x86_64.sh |
You should see a download bar showing you the file has begun downloading, when complete the ls
command will show you a new compressed file named 'Miniconda3-py39_4.9.2-Linux-x86_64.sh'
Using scp.
This is not necessary if you followed the wget commands above. Again In a new browser or tab you would navigate to https://docs.conda.io/en/latest/miniconda.html but instead of right clicking on the "Miniconda3 Linux 64-bit" in the linux installers section and choosing copy link address you would simply left click and allow the file to download directly to your browser's Downloads folder. Using information from the SCP tutorial you would then transfer the local 'Miniconda3-py39_4.9.2-Linux-x86_64.sh' file to the stampede2 remote location '$WORK2/src'.
Given that the wget command doesn't involve having to use MFA, or the somewhat cumbersome use of 2 differnt windows, and is subject to many fewer typos, hopefully you see how wget is preferable provided left clicking on a link directly downloads a file.
Finishing conda installation, and
Regardless of what method you chose to use, the following set of commands will work to install conda. For later reference, if you are planning to install miniconda on other systems or your local laptop, the 'regular installation' links on this link may be useful.
Code Block | ||||
---|---|---|---|---|
| ||||
bash Miniconda3-py39_4.9.2-Linux-x86_64.sh
logout
#log back in using the ssh command.
conda config --set auto_activate_base false |
For help with the ssh command please refer back to Windows10 or MacOS tutorials. If you log out and back in 1 more time, what do you notice is different?
The first time you logged back in, your promt should have looked like this:
No Format |
---|
The second time you logged back in, your prompt should now look like this:
No Format |
---|
If your prompt is different, please get the instructor's attention.
Setting up your first environment
Now that you have installed conda, we want to get started with our first environment. More information about environments and their purpose can be found here, but for now we will just think about them as different sets of programs and relevant dependencies being installed together.
Code Block | ||||
---|---|---|---|---|
| ||||
conda create --name GVA2021 |
3. Using miniconda on TACC
The anaconda or miniconda interfaces to the conda system is becoming increasingly popular for controlling one's environment, streamlining new program installation, and tracking what versions of programs are being used. A comparison of the two different interfaces can be found here. The deciding factor on which interface we will use is hinted at, but not explicitly stated in the referenced comparison: TACC does not have a GUI and therefore anacondaa will not work, which is why we installed miniconda above.
Similar to the module system that TACC uses, the "conda" system allows for simple commands to download required programs/packages, and modify environmental variables (like $PATH discussed above). Two huge advantages of conda over the module system, are: #1 instead of relying on the employees at TACC to take a program and package it for use in the module system, anyone (including the same authors publishing a new tool they want the community to use) can create a conda package for a program; #2 rather than being restricted to use on the TACC clusters, conda works on all platforms (including windows and macOS), and deal with all the required dependency programs in the background for you.
Info | ||
---|---|---|
| ||
In my own work, I recently remarked to my PI that "I wish I had started using this 5 years ago", and was reminded that "it didn't exist 5 years ago, at least in its current super usable and popular format". It is entirely possible that future classes will be taught with only minimal references to the TACC module system, and this years course will feature far fewer than any previous year. |
In the next tutorial we will start accessing the quality of some NGS reads using the fastqc program. Before we can use it, we must install it. Similar to the module system described above, to install a program via conda, we need 3 things:
- Tell bash we want to use the conda program.
- Tell conda we want to install a new program.
- Name the program we want to install.
Code Block | ||||
---|---|---|---|---|
| ||||
conda activate GVA2021
conda install fastqc
|
If you have already activated your GVA2021 environment, the first line will not do anything, but if you have not, you will see your promt has changed to now say (GVA2021) on the far left of the line. As to the second command, like we saw with the module system above, things aren't quite this simple. In this particular case, we get a very helpful error message that can guide our next steps:
No Format |
---|
PackagesNotFoundError: The following packages are not available from current channels:
- fastqc
Current channels:
- https://repo.anaconda.com/pkgs/main/linux-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/linux-64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page. |
More information about "channels" can be found here. By the end of this course you may find that the 'bioconda' channel is full of lots of programs you want to use, and may choose to permanently add it to your list of channels so the above command conda install fastqc
and others used in this course would work without having to go through the intermediate of searching for the specific installation commands, or finding what channel the program you want is in. Information about how to do this, as well as more detailed information of why it is bad practice to go around adding large numbers of channels can be found here.
For now, use the error message you saw above to try to install the fastqc program yourself.
Expand | ||
---|---|---|
| ||
https://anaconda.org/bioconda/fastqc If you were unable to find this page, the most likely error you entered fastqc into the search box, and you recognized that 360,000+ downloads was likely the program you wanted, you clicked the first bit of hyperlink you found which took you to the bioconda page instead of to the fastqc program. Personally, I think the entire box should be clickable to send you the program page, but nobody has asked me. |
Expand | |||||
---|---|---|---|---|---|
| |||||
While there are two other possible commands listed, I tend to always start with the simplest command and work my way from there. The other two commands deal with accessing specific labels/versions of the program. |
If all goes well, the installation command should give you the following output with you answering "y" when prompted if you actually want to install the packages:
No Format |
---|
The following packages will be downloaded: package | Total: 10.0 MB The following NEW packages will be INSTALLED: fastqc bioconda/noarch::fastqc-0.11.9-hdfd78af_1 font-ttf-dejavu-s~ pkgs/main/noarch::font-ttf-dejavu-sans-mono-2.37-h6964260_0 openjdk build ---------------------------|----------------- fastqc-0.11.9 pkgs/main/linux-64::openjdk-8.0.152-h7b6447c_3 Proceed ([y]/n)? y | Downloading and Extracting Packages fastqc-0.11.9 hdfd78af_1 | 9.7 MB bioconda | ####################################################################################################################################################################################### | 100% font-ttf-dejavu-sans -mono-2.37| 335 KB | #######################################################################################################################################################################################h6964260_0 | 100% Preparing transaction: done Verifying transaction: done Executing335 transaction: done |
There are three commonly used methods to verify you have a given program installed. You should try all three in order for the fastqc program:
Code Block language bash title The 'which' command can be used to search your $PATH variable for a command with a specific name, and return the location the command is stored in which fastqc
Code Block language bash title Many commands accept an option of '--version' to simply access the program and return what version of the program is installed fastqc --version
Code Block language bash title Nearly all commands/programs accept "-h" or "--help" options to give you basic information about how the command or program works fastqc --help
Throughout the course, you will routinely use the above 3 commands to make sure that you have access to a given program, that it is the correct version, and to get an idea of how to construct commands to perform a given analysis step. For now, be satisfied that if you get output that is not the following that you have correctly installed fastqc. In the next tutorial we will actually use fastqc. Examples of output you do not want to see to the above commands:
/usr/bin/which: no fastqc in (<large list of directories specific to your TACC account>)
-bash: fastqc: command not found
-bash: fastqc: command not found
Additional common methods of getting files onto TACC
Github
This is about using the git clone
command. Git is a command often used for collaborative program development or sharing of files. Some groups also put the programs or scripts associated with a particular paper on a github project and publish the link in their paper or on their lab website. Github repositories are a great thing to add to a single location in your $WORK2 directory.
Here we will clone the github repository for the E. coli Long-Term Evolution Experiment (LTEE) originally started by Dr. Richard Lenski. These files will be used in some of the later tutorials, and are a good source of data for identifying variants in NGS data as the variants are well documented, and emerge in a controlled manner over the course of the evolution experiment. Initially cloning a github repository as exceptionally similar to using the wget
command to download the repository, it involves typing 'git clone
' followed by a web address where the repository is stored. As we did for installing miniconda, with wget we'll clone the repository into a 'src' directory inside of $WORK2.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cd $WORK2
mkdir src
cd src |
If you already have a src directory, you'll get a very benign error message stating that the folder already exists and thus can not be created.
In a web browser navigate to github and search for 'LTEE-Ecoli' in the top right corner of the page. The only result will be for barricklab/LTEE-Ecoli; click the green box for 'clone or download' and either control/command + C on the address listed, or click the clipboard icon to copy the repository address. This image may be helpful if you are are having trouble locating the green box.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
git clone https://github.com/barricklab/breseq.git |
You will see several download indicators increase to 100%, and when you get your command prompt back the ls
command will show a new folder named 'breseq' containing a set of files. If you don't see said directory, or can't cd into that directory let the instructor know.
As with Trimmomatic, these files will require additional work that is somewhat specific to the specific program and there for beyond the scope of this tutorial. A link to the advanced tutorials for getting your own copy of breseq up and running will be added later in the week.
pip
This is about using the pip3 install
command. pip is the standard package manager for the common programing language python. When labs put together new analysis programs/packages, increasingly they try to make these programs available for others to use via pip. pip3 rather than just pip references the specific version of python.
Here we will install the multiqc
analysis program which compiles reports from a program called fastqc
about the quality of fastq files from multiple different samples at one time. In the later portion of the class you may choose to work with this program to get a better overall view of multiple fastq files all at once rather than clicking through individual reports.
Code Block | ||||
---|---|---|---|---|
| ||||
pip3 install --user multiqc |
*note that the "--user" option in the above code is required while working on LS5 because individual users do not have access to core systems. If you have python3 on your personal computer and wanted to install multiqc (or any other package available through pip) you would typically omit the "--user" flag.
Installation may take a minute or two depending on your internet connection and you will see several progress bars. Eventually you should see a line that starts with "Successfully installed
" and then a long list of packages including multiqc-1.9. The additional packages listed are packages that multiqc will use to generate its figures.
Code Block | ||||
---|---|---|---|---|
| ||||
which multiqc
multiqc |
The first line should return something that starts with /home1/
then has a number and your user id followed by /.local/bin/multiqc
. The second line should tell you that there is an error as you didn't provide an argument for the analysis directory as well as that you are using multiqc version 1.9. If you see other results, try this more complicated installation and recheck the installation:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
python3 -m pip install --user multiqc |
If you still see something else, let the instructor know.
The multiqc tutorial can be found here.
...
KB
------------------------------------------------------------
Total: 10.0 MB
The following NEW packages will be INSTALLED:
fastqc bioconda/noarch::fastqc-0.11.9-hdfd78af_1
font-ttf-dejavu-s~ pkgs/main/noarch::font-ttf-dejavu-sans-mono-2.37-h6964260_0
openjdk pkgs/main/linux-64::openjdk-8.0.152-h7b6447c_3
Proceed ([y]/n)? y
Downloading and Extracting Packages
fastqc-0.11.9 | 9.7 MB | ####################################################################################################################################################################################### | 100%
font-ttf-dejavu-sans | 335 KB | ####################################################################################################################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done |
There are three commonly used methods to verify you have a given program installed. You should try all three in order for the fastqc program:
Code Block language bash title The 'which' command can be used to search your $PATH variable for a command with a specific name, and return the location the command is stored in which fastqc
Code Block language bash title Many commands accept an option of '--version' to simply access the program and return what version of the program is installed fastqc --version
Code Block language bash title Nearly all commands/programs accept "-h" or "--help" options to give you basic information about how the command or program works fastqc --help
Throughout the course, you will routinely use the above 3 commands to make sure that you have access to a given program, that it is the correct version, and to get an idea of how to construct commands to perform a given analysis step. For now, be satisfied that if you get output that is not the following that you have correctly installed fastqc. In the next tutorial we will actually use fastqc. Examples of output you do not want to see to the above commands:
/usr/bin/which: no fastqc in (<large list of directories specific to your TACC account>)
-bash: fastqc: command not found
-bash: fastqc: command not found
Github – an additional common method of getting files onto TACC
This is about using the git clone
command. Git is a command often used for collaborative program development or sharing of files. Some groups also put the programs or scripts associated with a particular paper on a github project and publish the link in their paper or on their lab website. Github repositories are a great thing to add to a single location in your $WORK2 directory.
Here we will clone the github repository for the E. coli Long-Term Evolution Experiment (LTEE) originally started by Dr. Richard Lenski. These files will be used in some of the later tutorials, and are a good source of data for identifying variants in NGS data as the variants are well documented, and emerge in a controlled manner over the course of the evolution experiment. Initially cloning a github repository as exceptionally similar to using the wget
command to download the repository, it involves typing 'git clone
' followed by a web address where the repository is stored. As we did for installing miniconda, with wget we'll clone the repository into a 'src' directory inside of $WORK2.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cd $WORK2
mkdir src
cd src |
If you already have a src directory, you'll get a very benign error message stating that the folder already exists and thus can not be created.
In a web browser navigate to github and search for 'LTEE-Ecoli' in the top right corner of the page. The only result will be for barricklab/LTEE-Ecoli; click the green box for 'clone or download' and either control/command + C on the address listed, or click the clipboard icon to copy the repository address. This image may be helpful if you are are having trouble locating the green box.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
git clone https://github.com/barricklab/LTEE-Ecoli.git |
You will see several download indicators increase to 100%, and when you get your command prompt back the ls
command will show a new folder named 'LTEE-Ecoli' containing a set of files. If you don't see said directory, or can't cd into that directory let the instructor know.
pip
In previous years, the pip installation program was used to install a few programs. While those programs will be installed through conda this year, the link here is provided to give a detailed walk through of how to use pip on TACC resources. This is particularly helpful for making use of the '--user' flag during the installation process as you do not have the expected permissions to install things in the default directories.
This concludes the the linux and stampede2 refresher/introduction tutorial.
Genome Variant Analysis Course 2020 2021 home.