The following DNA sequencing read data files were downloaded from the NCBI Sequence Read Archive via the corresponding European Nucleotide Archive record. They are Illumina Genome Analyzer sequencing of a paired-end library from a (haploid) E. coli clone that was isolated from a population of bacteria that had evolved for 20,000 generations in the laboratory as part of a long-term evolution experiment (Barrick et al, 2009). The reference genome is the ancestor of this E. coli population (strain REL606), so we expect the read sample to have differences from this reference that correspond to mutations that arose during the evolution experiment.

Transferring Data

We have already downloaded data files for this example and put them in the pathRather than having to download these files from the SRA or EUN and NCBI, these data files are available in the following directory:

In this case the perl script is included as part of the bioperl module package. Rather than attempt to find it as part of a conda package, or in some other repository we will use the module version. If needing this script in the future outside of TACC,

titleRecall that we have used the which command to determine where executable files are located, and only take 2 pieces of information.Load the bioperl module and run the script without any options to display the help contents
module load bioperl/1.007002
which -a


titleThe information in this box is related to the path variable, perl programming libraries, having multiple copies of a script/file available in your path, and computer architecture. If you are not interested in this, you can skip this box.

titleOn the head node, after you have installed the bioperl module, there are actually 2 instances of available to you.
module load bioperl/1.007002
which -a

If you run on an idev node you get 1 result related to the bioperl module, but if you run on the head node (outside idev) you get 2 results. On the head node, 1 points to the BioITeam near where you keep finding your data (/corral-repl/utexas/BioITeam/) which is part of the


BioITeam, specifically the "bin" folder


which is full of binary or (typically small) bash/python/perl/R scripts that someone has written to help the TACC community. The other is in a folder specifically associated with the bioperl module. You can load and unload the bioperl module to see the difference.

titleWhy do you get 2 different results depending on if you are inside or outside of an idev node

This has to do with how compute nodes are configured. On stampede2 /corral-repl/ and all of its subdirectories are not accessible so even though the BioITeam is in your $PATH, on the compute node, the command line can't access it. This is why in later tutorials you have to log out of the idev session to copy new raw data files to work with.

If you try to run the BioITeam version of the script




pl)from the head node without the bioperl module loaded, you get an error message similar to the following:

module unload bioperl

Can't locate Bio/ in @INC (@INC contains: /corral-repl/utexas/BioITeam//local/share/perl5 /corral-repl/utexas/BioITeam//perl5/lib/perl5/x86_64-linux-thread-multi /corral-repl/utexas/BioITeam//perl5/lib/perl5 /corral-repl/utexas/BioITeam//perl5/lib64/perl5/auto /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /corral-repl/utexas/BioITeam/bin/ line 8.
BEGIN failed--compilation aborted at /corral-repl/utexas/BioITeam/bin/ line 8.

titleDeciphering error messages

The above error message is pretty helpful, but much less so if you are not familiar with perl. As I doubt anyone in the class is more familiar with perl than I am, and I am not familiar with perl hardly at all, this is a great opportunity to explain how I would tackle the error message to figure out what is going on.

  1. "compilation aborted at /corral-repl/utexas/BioITeam/bin/ line 8." 
    1. The last line here actually tells us that the script did not get very far, only to line 8.
    2. My experience with other programing language tells me that the beginning of scripts is all about checking that the script has access to all the things it needs to do what it is intended to do, so this has me thinking some kind of package might be missing.
  2. "(@INC contains: ..."
    1. This reads like the PATH variable, but is locations I don't recognize as being in my path, suggesting this is not some external binary or other program.
    2. Many of the individual pathways list "lib" in one form or another. This reinforces the idea from above that some kind of package is missing.
  3. "Can't locate Bio/ in @INC"
    1. "Can't locate" reads like a plain text version of something being missing, and like something generic that is not related to my system/environment (like all the listed directories), and not related to the details of the script I am trying to run (like the last line that details the name of the script we tried to envoke)
    2. This is what should be googled for help solving the problem. 
      1.  the google results list similar error messages associated with different repositories/programs (github issues) suggesting some kind of common underlying problem.
      2. The 3rd result reads like a generic problem and sure enough the answers detail needing to have the Bio library installed from cpan (perl's package management system)

We get this error message because because


while perl is installed on stampede2, the required

... library is not


available by default


but it is easily installed with the bioperl module. As it is likely rare that you will need to convert sequence files between different format, bioperl is actually not listed as one of the modules on your .bashrc file in your $HOME directory that you set up yesterday


After loading the bioperl library to get past the error message, run the script from the BioITeam without any arguments to get the help message:

, but if you find yourself using the command `module load bioperl` often, you may want to add it.

titleOn the head node, after loading the bioperl module, you have access to the program in 2 different locations. 
module load bioperl


How does the computer know which location to use?

  • It will use whatever location it finds earliest in the $PATH,
  • which is the same as the top line in the which -a command output,
  • which is the same as the line printed if you run the `which` command without the "-a".

Using just the script name by itself, will use which ever is found first, but you can always force the computer to use a given copy by specifying the full path to the copy you want. Thus, the following 2 commands are not equal:

While the commands are different, both copies can use the same bioperl library when the bioperl module is loaded and thus work. 

Convert a gbk reference to a embl reference


titleTry reading through the program help when you run the without any options to see the syntax required
module load bioperl --from genbank --to embl < NC_012967.1.gbk > NC_012967.1.embl
head -n 100 NC_012967.1.embl


While you could consult previous year's tutorial for installing bowtie2 via the module system, this year's course will be using the conda system to install it. The bowtie2 home page can be found here, and if you needed to download the program itself, version 2.45.5 could 1 could be downloaded here. Instead, we want to make sure the bowtie2 version 2.45.5 1 is installed via conda Like we did for fastqc and cutadapt. See if you can figure out how to install bowtie2 into a new conda environment named "GVA-bowtie2-mapping". Note that "2" is actually part of the program name, neither a typo nor a comment on the program version.


We have actually massively under-utilized stampede2 in this example by only using 8 cores. We ran the command using only 8 processors rather than the 68 48 we have available on our idev session. if we increase to 68 48 total processors and rerun the analysis, how long do you expect the command to take?

titleModify the previous mapping command to re-run this analysis using all 68 cores.

You need to increase the -p, for "processors" option from 8 to 6848

titleclick here to check your answer
bowtie2 -t -p 6848 -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam

Try it out and compare the speed of execution by looking at the times listed at the end of each command

titleHow much faster was it using all 68 processors?

8 processor took a little over 5 minutes, 68 processors took ~1 minute. Can you think of any reasons why it was ~ 5x faster rather than ~8x faster? — note the times here are incorrect but the principle is the same


Anytime you use multiprocessing correctly things will go faster, but even if a program can divide the input perfectly among all available processors, and combine the outputs back together perfectly, there is "overhead" in dividing things up and recombining them. These are the types of considerations you may have to make with your data: When is it better to give more processors to a single sample? How fast do I actually need the data to come back?

An additional note from the stampede2 user manual is that while there are 68 cores available, and each core is capable of hyperthreading 4 x processors per core using all 272 processors is rarely the go to solution. While I am sure that this is more rigorously and appropriately tested in some other manner, I ran a test using different numbers of processors with the following results:

-p optiontime (min:sec)

Again while there are almost certainly better ways to benchmark this, there are 2 things of note that are illustrated here:

  1. ~doubling the number of processors does not reduce the time in half, and while some applications may use hyperthreading on the individual cores appropriately, and assuming a program can/will actually makes things take longer. 
  2. Working on your laptop (which likely has at most 4-8 processors available) would significantly increase the amount of time these tutorials take.


  • In the bowtie2 example, we mapped in --local mode. Try mapping in --end-to-end mode (aka global mode).

  • Do the BWA tutorial so you can compare their outputs (note BWA has a conda package making it even easier to try).
    • Did bowtie2 or BWA map more reads?
    • In our examples, we mapped in paired-end mode. Try to figure out how to map the reads in single-end mode and create this output.
    • Which aligner took less time to run? Are there any options you can change that:
      • Lead to a larger percentage of the reads being mapped? (increase sensitivity)
      • Speed up run time without causing many fewer reads to be mapped? (increase performance)
