Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Here we will download the installation file for miniconda (which we will use in the next section and throughout the course) using both scp and wget to compare and contrast their functionality. 

3. Using miniconda on TACC

...

titleConda environments in the instructors work

...

In the next tutorial we will start accessing  the quality of some NGS reads using the fastqc program. Before we can use it, we must install it. Similar to the module system described above, to install a program via conda, we need 3 things:

  1. Tell bash we want to use the conda program.
  2. Tell conda we want to install a new program.
  3. Name the program we want to install.
Code Block
languagebash
titleattempt to install the fastqc program using conda
 conda install fastqc

Like we saw with the module system above, things aren't quite this simple. In this particular case, we get a very helpful error message that can guide our next steps:

No Format
PackagesNotFoundError: The following packages are not available from current channels:

  - fastqc

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

...

Expand
titleIf you are having trouble finding the fastqc page on anaconda, the answer is here, as well as a description of the most likely problem you encountered.

https://anaconda.org/bioconda/fastqc

If you were unable to find this page, the most likely error you entered fastqc into the search box, and you recognized that 360,000+ downloads was likely the program you wanted, you clicked the first bit of hyperlink you found which took you to the bioconda page instead of to the fastqc program. Personally, I think the entire box should be clickable to send you the program page, but nobody has asked me.

Expand
titleClick here if you are unsure what command to use to install fastqc, or want to check your understanding
Code Block
languagebash
conda install -c bioconda fastqc

While there are two other possible commands listed, I tend to always start with the simplest command and work my way from there. The other two commands deal with accessing specific labels/versions of the program.

If all goes well, the installation command should give you the following output with you answering "y" when prompted if you actually want to install the packages:

...

Using wget.

In a new browser or tab navigate to https://docs.conda.io/en/latest/miniconda.html and right click on the "Miniconda3 Linux 64-bit" in the linux installers section and choose copy link address.

Code Block
languagebash
titleUsing the mkdir command to create a folder named 'src' inside of your $WORK2 directory
collapsetrue
cd $WORK2
mkdir src
cd src
Code Block
languagebash
titleUse the wget command to download the linux installer directly to your current directory
collapsetrue
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2-Linux-x86_64.sh

You should see a download bar showing you the file has begun downloading, when complete the ls command will show you a new compressed file named 'Miniconda3-py39_4.9.2-Linux-x86_64.sh'

Using scp.

This is not necessary if you followed the wget commands above. Again In a new browser or tab you would navigate to https://docs.conda.io/en/latest/miniconda.html but instead of right clicking on the "Miniconda3 Linux 64-bit" in the linux installers section and choosing copy link address you would simply left click and allow the file to download directly to your browser's Downloads folder. Using information from the SCP tutorial you would then transfer the local 'Miniconda3-py39_4.9.2-Linux-x86_64.sh' file to the stampede2 remote location '$WORK2/src'.

Given that the wget command doesn't involve having to use MFA, or the somewhat cumbersome use of 2 differnt windows, and is subject to many fewer typos, hopefully you see how wget is preferable provided left clicking on a link directly downloads a file.

Finishing conda installation, and 

Regardless of what method you chose to use, the following set of commands will work to install conda. For later reference, if you are planning to install miniconda on other systems or your local laptop, the 'regular installation' links on this link may be useful.


Code Block
languagebash
titleThe following commands are then used to install miniconda, and only activate when you explicitly tell it to.
bash Miniconda3-py39_4.9.2-Linux-x86_64.sh
logout
#log back in using the ssh command. 
conda config --set auto_activate_base false


For help with the ssh command please refer back to Windows10 or MacOS tutorials. If you log out and back in 1 more time, what do you notice is different?

The first time you logged back in, your promt should have looked like this:

No Format


The second time you logged back in, your prompt should now look like this:

No Format


If your prompt is different, please get the instructor's attention.

Setting up your first environment

Now that you have installed conda, we want to get started with our first environment. More information about environments and their purpose can be found here, but for now we will just think about them as different sets of programs and relevant dependencies being installed together. 

Code Block
languagebash
titleusing the conda create command, make a new environment named "gva2021"
conda create --name GVA2021

3. Using miniconda on TACC

The anaconda or miniconda interfaces to the conda system is becoming increasingly popular for controlling one's environment, streamlining new program installation, and tracking what versions of programs are being used. A comparison of the two different interfaces can be found here. The deciding factor on which interface we will use is hinted at, but not explicitly stated in the referenced comparison: TACC does not have a GUI and therefore anacondaa will not work, which is why we installed miniconda above.

Similar to the module system that TACC uses, the "conda" system allows for simple commands to download required programs/packages, and modify environmental variables (like $PATH discussed above). Two huge advantages of conda over the module system, are: #1 instead of relying on the employees at TACC to take a program and package it for use in the module system, anyone (including the same authors publishing a new tool they want the community to use) can create a conda package for a program; #2 rather than being restricted to use on the TACC clusters, conda works on all platforms (including windows and macOS), and deal with all the required dependency programs in the background for you. 

Info
titleConda environments in the instructors work

In my own work, I recently remarked to my PI that "I wish I had started using this 5 years ago", and was reminded that "it didn't exist 5 years ago, at least in its current super usable and popular format". It is entirely possible that future classes will be taught with only minimal references to the TACC module system, and this years course will feature far fewer than any previous year. 

While you may be thinking that since the conda system can work on your personal computer, you may want to just work on your personal computer for the duration of this class and ignore all the ssh commands and working remotely. This is strongly not advised. While you would be able to use the same programs in both instances (in most cases), the tutorials are developed with the speed of the stampede2 system in mind and attempt to minimize "waiting for something to finish" to how long it takes someone to read through the next block of text on the tutorial with some exceptions. If you were to do these tutorials on your personal computer, the timing would significantly increase and it would be difficult to keep up with the rest of the class.

In Friday's lecture I will explain why installing and using conda on your local computer is still a good idea and how I am currently using it in conjuncture with TACC.

In the next tutorial we will start accessing  the quality of some NGS reads using the fastqc program. Before we can use it, we must install it. Similar to the module system described above, to install a program via conda, we need 3 things:

  1. Tell bash we want to use the conda program.
  2. Tell conda we want to install a new program.
  3. Name the program we want to install.


Code Block
languagebash
titleattempt to install the fastqc program using conda
conda activate GVA2021 
conda install fastqc

If you have already activated your GVA2021 environment, the first line will not do anything, but if you have not, you will see your promt has changed to now say (GVA2021) on the far left of the line. As to the second command, like we saw with the module system above, things aren't quite this simple. In this particular case, we get a very helpful error message that can guide our next steps:

No Format
PackagesNotFoundError: The following packages are not available from current channels:

  - fastqc

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

More information about "channels" can be found here. By the end of this course you may find that the 'bioconda' channel is full of lots of programs you want to use, and may choose to permanently add it to your list of channels so the above command conda install fastqc and others used in this course would work without having to go through the intermediate of searching for the specific installation commands, or finding what channel the program you want is in. Information about how to do this, as well as more detailed information of why it is bad practice to go around adding large numbers of channels can be found here.

For now, use the error message you saw above to try to install the fastqc program yourself.

Expand
titleIf you are having trouble finding the fastqc page on anaconda, the answer is here, as well as a description of the most likely problem you encountered.

https://anaconda.org/bioconda/fastqc

If you were unable to find this page, the most likely error you entered fastqc into the search box, and you recognized that 360,000+ downloads was likely the program you wanted, you clicked the first bit of hyperlink you found which took you to the bioconda page instead of to the fastqc program. Personally, I think the entire box should be clickable to send you the program page, but nobody has asked me.

Expand
titleClick here if you are unsure what command to use to install fastqc, or want to check your understanding
Code Block
languagebash
conda install -c bioconda fastqc

While there are two other possible commands listed, I tend to always start with the simplest command and work my way from there. The other two commands deal with accessing specific labels/versions of the program.

If all goes well, the installation command should give you the following output with you answering "y" when prompted if you actually want to install the packages:

No Format
The following packages will be downloaded:

    package                    |             Total:        10.0 MB

The following NEW packages will be INSTALLED:

  fastqc             bioconda/noarch::fastqc-0.11.9-hdfd78af_1
  font-ttf-dejavu-s~ pkgs/main/noarch::font-ttf-dejavu-sans-mono-2.37-h6964260_0
  openjdk  build
    ---------------------------|-----------------
    fastqc-0.11.9          pkgs/main/linux-64::openjdk-8.0.152-h7b6447c_3   Proceed ([y]/n)? y |    Downloading and Extracting Packages
fastqc-0.11.9 hdfd78af_1        | 9.7 MB   bioconda
| ####################################################################################################################################################################################### | 100%  font-ttf-dejavu-sans -mono-2.37| 335 KB    | #######################################################################################################################################################################################h6964260_0 | 100%  Preparing transaction: done Verifying transaction: done Executing335 transaction: done

There are three commonly used methods to verify you have a given program installed. You should try all three in order for the fastqc program:

  1. Code Block
    languagebash
    titleThe 'which' command can be used to search your $PATH variable for a command with a specific name, and return the location the command is stored in
    which fastqc
  2. Code Block
    languagebash
    titleMany commands accept an option of '--version' to simply access the program and return what version of the program is installed
    fastqc --version
  3. Code Block
    languagebash
    titleNearly all commands/programs accept "-h" or "--help" options to give you basic information about how the command or program works
    fastqc --help

Throughout the course, you will routinely use the above 3 commands to make sure that you have access to a given program, that it is the correct version, and to get an idea of how to construct commands to perform a given analysis step. For now, be satisfied that if you get output that is not the following that you have correctly installed fastqc. In the next tutorial we will actually use fastqc. Examples of output you do not want to see to the above commands:

  1. /usr/bin/which: no fastqc in (<large list of directories specific to your TACC account>)

  2. -bash: fastqc: command not found

  3. -bash: fastqc: command not found

Additional common methods of getting files onto TACC

Github

This is about using the git clone command. Git is a command often used for collaborative program development or sharing of files. Some groups also put the programs or scripts associated with a particular paper on a github project and publish the link in their paper or on their lab website. Github repositories are a great thing to add to a single location in your $WORK2 directory.

Here we will clone the github repository for the E. coli Long-Term Evolution Experiment (LTEE) originally started by Dr. Richard Lenski. These files will be used in some of the later tutorials, and are a good source of data for identifying variants in NGS data as the variants are well documented, and emerge in a controlled manner over the course of the evolution experiment. Initially cloning a github repository as exceptionally similar to using the wget command to download the repository, it involves typing 'git clone' followed by a web address where the repository is stored. As we did for installing miniconda, with wget we'll clone the repository into a 'src' directory inside of $WORK2.

Code Block
languagebash
titleUsing the mkdir command to create a folder named 'src' inside of your $WORK directory
collapsetrue
cd $WORK2
mkdir src
cd src

If you already have a src directory, you'll get a very benign error message stating that the folder already exists and thus can not be created. 

In a web browser navigate to github and search for 'LTEE-Ecoli' in the top right corner of the page. The only result will be for barricklab/LTEE-Ecoli; click the green box for 'clone or download' and either control/command + C on the address listed, or click the clipboard icon to copy the repository address. This image may be helpful if you are are having trouble locating the green box

Code Block
languagebash
titleOnce you have copied the address and are in the $WORK/src directory clone the repository with 'git clone'
collapsetrue
git clone https://github.com/barricklab/breseq.git

You will see several download indicators increase to 100%, and when you get your command prompt back the ls command will show a new folder named 'breseq' containing a set of files. If you don't see said directory, or can't cd into that directory let the instructor know.

As with Trimmomatic, these files will require additional work that is somewhat specific to the specific program and there for beyond the scope of this tutorial. A link to the advanced tutorials for getting your own copy of breseq up and running will be added later in the week. 

pip

This is about using the pip3 install command. pip is the standard package manager for the common programing language python. When labs put together new analysis programs/packages, increasingly they try to make these programs available for others to use via pip. pip3 rather than just pip references the specific version of python.

Here we will install the multiqc analysis program which compiles reports from a program called fastqc about the quality of fastq files from multiple different samples at one time. In the later portion of the class you may choose to work with this program to get a better overall view of multiple fastq files all at once rather than clicking through individual reports.

Code Block
languagebash
titlePreferred simple installation
pip3 install --user multiqc

*note that the "--user" option in the above code is required while working on LS5 because individual users do not have access to core systems. If you have python3 on your personal computer and wanted to install multiqc (or any other package available through pip) you would typically omit the "--user"  flag.

Installation may take a minute or two depending on your internet connection and you will see several progress bars. Eventually you should see a line that starts with "Successfully installed" and then a long list of packages including multiqc-1.9. The additional packages listed are packages that multiqc will use to generate its figures.

Code Block
languagebash
titleVerify that multiqc was successfully installed
which multiqc
multiqc

The first line should return something that starts with /home1/ then has a number and your user id followed by /.local/bin/multiqc. The second line should tell you that there is an error as you didn't provide an argument for the analysis directory as well as that you are using multiqc version 1.9. If you see other results, try this more complicated installation and recheck the installation:

Code Block
languagebash
titleMore complicated invocation that may work in some instances when simple invocation fails
collapsetrue
python3 -m pip install --user multiqc

If you still see something else, let the instructor know.

The multiqc tutorial can be found here.

...

KB
    ------------------------------------------------------------
                                           Total:        10.0 MB

The following NEW packages will be INSTALLED:

  fastqc             bioconda/noarch::fastqc-0.11.9-hdfd78af_1
  font-ttf-dejavu-s~ pkgs/main/noarch::font-ttf-dejavu-sans-mono-2.37-h6964260_0
  openjdk            pkgs/main/linux-64::openjdk-8.0.152-h7b6447c_3


Proceed ([y]/n)? y


Downloading and Extracting Packages
fastqc-0.11.9        | 9.7 MB    | ####################################################################################################################################################################################### | 100% 
font-ttf-dejavu-sans | 335 KB    | ####################################################################################################################################################################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

There are three commonly used methods to verify you have a given program installed. You should try all three in order for the fastqc program:

  1. Code Block
    languagebash
    titleThe 'which' command can be used to search your $PATH variable for a command with a specific name, and return the location the command is stored in
    which fastqc
  2. Code Block
    languagebash
    titleMany commands accept an option of '--version' to simply access the program and return what version of the program is installed
    fastqc --version
  3. Code Block
    languagebash
    titleNearly all commands/programs accept "-h" or "--help" options to give you basic information about how the command or program works
    fastqc --help

Throughout the course, you will routinely use the above 3 commands to make sure that you have access to a given program, that it is the correct version, and to get an idea of how to construct commands to perform a given analysis step. For now, be satisfied that if you get output that is not the following that you have correctly installed fastqc. In the next tutorial we will actually use fastqc. Examples of output you do not want to see to the above commands:

  1. /usr/bin/which: no fastqc in (<large list of directories specific to your TACC account>)

  2. -bash: fastqc: command not found

  3. -bash: fastqc: command not found

Github – an additional common method of getting files onto TACC

This is about using the git clone command. Git is a command often used for collaborative program development or sharing of files. Some groups also put the programs or scripts associated with a particular paper on a github project and publish the link in their paper or on their lab website. Github repositories are a great thing to add to a single location in your $WORK2 directory.

Here we will clone the github repository for the E. coli Long-Term Evolution Experiment (LTEE) originally started by Dr. Richard Lenski. These files will be used in some of the later tutorials, and are a good source of data for identifying variants in NGS data as the variants are well documented, and emerge in a controlled manner over the course of the evolution experiment. Initially cloning a github repository as exceptionally similar to using the wget command to download the repository, it involves typing 'git clone' followed by a web address where the repository is stored. As we did for installing miniconda, with wget we'll clone the repository into a 'src' directory inside of $WORK2.

Code Block
languagebash
titleUsing the mkdir command to create a folder named 'src' inside of your $WORK2 directory
collapsetrue
cd $WORK2
mkdir src
cd src

If you already have a src directory, you'll get a very benign error message stating that the folder already exists and thus can not be created. 

In a web browser navigate to github and search for 'LTEE-Ecoli' in the top right corner of the page. The only result will be for barricklab/LTEE-Ecoli; click the green box for 'clone or download' and either control/command + C on the address listed, or click the clipboard icon to copy the repository address. This image may be helpful if you are are having trouble locating the green box

Code Block
languagebash
titleOnce you have copied the address and are in the $WORK2/src directory clone the repository with 'git clone'
collapsetrue
git clone https://github.com/barricklab/LTEE-Ecoli.git

You will see several download indicators increase to 100%, and when you get your command prompt back the ls command will show a new folder named 'LTEE-Ecoli' containing a set of files. If you don't see said directory, or can't cd into that directory let the instructor know.

pip

In previous years, the pip installation program was used to install a few programs. While those programs will be installed through conda this year, the link here is provided to give a detailed walk through of how to use pip on TACC resources. This is particularly helpful for making use of the '--user' flag during the installation process as you do not have the expected permissions to install things in the default directories.

This concludes the the linux and stampede2 refresher/introduction tutorial.

Genome Variant Analysis Course 2020 2021 home.