Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

So why is it when the module is loaded it finds the 1.3 version rather than the 0.1.18 version? The answer is thats what the $PATH variable is set to find. Consider the following information :

 

 

 

You may have noticed that module list samtools told you that you had version 1.3 loaded already. This raises the question of why samtools points at version 0.1.18 in the BioITeam resources rather than the module version. To understand this, consider the following information about the module system and the $PATH variable:

...

  1. When you type a command, only locations that are in your PATH variable are searched for an executable command matching that name.
  2. When the command is found in your PATH, the computer immediately substitutes the location it was found for the short command name you entered, and stops searching.
    1. This means that things that are early in your path are always searched first. In some extreme cricumstances if you add a bunch of locations with lots of files to the front of your PATH, you can actually slow down your entire computer, so try to limit the path variable to only look in directories containing executables.
  3. The module system always assumes that when you load a module, you intend to use it, and thus puts the executables for that module at the front of your PATH.
  4. In your .bashrc file, modules are loaded first (including samtools). 
  5. After modules are loaded, we further manipulate your PATH variable several times. The last section involving breseq has 2 alternative manipulations:
    1. The first which you can see we have commented out:

      No Format
      # export PATH=$BI/breseq/bin:$PATH
      1. This command says make the variable PATH equal to the variable BI plus /breseq/bin and then add on the existing value of $PATH
    2. The second we actually use.

      No Format
      export PATH=$PATH:$BI/breseq/bin
      1. This command says make the variable PATH equal to the existing value of $PATH variable and then add on BI plus /breseq/bin 
  6. Warning
    titleOne of the most important lessons you can ever learn
    anytime

    Anytime you manipulate your PATH variable you always want to make sure that you include $PATH on the right side of the equation somewhere separated by : either before it, after it, or on both sides of it if you want it in the middle of 2 different locations. As we are explaining right now, there are reasons and times to put it in different relative places, but if you fail to include it (or include a typo in it by calling it say $PTAH) you can actually remove access to all existing commands

    (

    including the most basic things like "ls" "mkdir" "cd".

  7. Together all this means that the last modification to your PATH was the addition of a different program (breseq) to your PATH, unexpectedly breseq also includes a samtools distribution making it the first thing that is found when you use the command samtools. This is not an uncommon type of thing for programs to do when they rely on a specific version of a program to run themselves.

...

titleclick here for instructions on how to test/verify that this is true
Expand
titleReload the module version of samtools which is the correct one for the tutorial.

Simply reload samtools using the module system, check the version, and where that which version is now being called fromused.

Code Block
languagebash
titlestill stuck?
collapsetrue
module load samtools
samtools  # check output for version information
which samtools

...

Warning
titleDo not run on head node

Use showq -u to verify you are still on the idev node.

Code Block
titleUse this command to restart an idev session if you are not on one
collapsetrue
idev  -m 120 -r CCBB_Bio_Summer_School_2016_Day25.23.17PM -A UT-2015-05-18 -N 2 -n 8
Code Block
languagebash
titleCommands to be executed in order...
samtools view -b -S -o SRR030257.bam SRR030257.sam
samtools sort SRR030257.bam -o SRR030257.sorted.bam
samtools index SRR030257.sorted.bam
Tip
This is a really common sequence of commands, so you might want to add it to your personal cheat sheet.

...

  • Expand
    titleWhat new files were created by these commands?
    Code Block
    languagebash
    titleList the contents of the output directory
    ls 
    Code Block
    titleExpected output
    NC_012967.1.fasta-1  # This is  SRR030257.sorted.bam.bai
    NC_012967.1.fasta.fai  SRR030257.sam
    SRR030257.bam          SRR030257.sorted.bam
    
    Expand
    titleWhy didn't we name the output SRR030257.sorted.bam in the samtools sort command?
    Samtools appends an extra .bam to whatever we put here, so it would have created SRR030257.sorted.bam.bam, and then we would have had to make a joke about the Flintstones.
    the number 1 not the letter L. Several people were confused about this yesterday, it is not necessary for any reason other than to give a single column of output regardless of window size.
    
    Code Block
    titleExpected output
    NC_012967.1.fasta
    NC_012967.1.fasta.fai
    SRR030257.bam
    SRR030257.sam
    SRR030257.sorted.bam
    SRR030257.sorted.bam.bai
  • Expand
    titleCan you guess what a *.bai file is?

    Sure enough, it's the index file for the BAM file.

...

Expand
Optionpurpose
-ugenerates uncompressed BCF output
-f NC_012967.1.fasta.fai

faidx indexed reference sequence file that has a corresponding faidx index .fai file

SRR030257.sorted.bamBAM input file to calcluate calculate pileups from
> SRR030257.bcfDirect output to SRR030257.bcf

...

Code Block
titleThis is *one* command. Put it all on one line.
bcftools viewcall -v -c -g SRR030257.bcf > SRR030257.vcf

...

Expand
titleSee if you can start with the base command "bcftools" and figure out what each option is doing. Click here when you think you know.
Optionpurpose
viewcallspecific command to be executed by bcftools for calling SNP and indels
-v

output potential variant sites only

-cSNP consensus calling
-gcall genotypes at variant sites
SRR030257.bcf
input bcf file
> SRR030257.vcf
output as a vcf file

...

Expand
titleOptional: For the data we are dealing with, predictions with an allele frequency not equal to 1 are not really applicable. (The reference genome is haploid. There aren't any heterozygotes.) How can we remove these lines from the file?

Try looking at grep --help to see what you can come up with.

Code Block
languagebash
titleHere for answer
collapsetrue
grep -v *something*  # The -v flag inverts the match effecitvely showing you everything that does not match your input
Expand
titleGoing farther
Code Block
cat inputSRR030257.vcf | grep AF1=1 > outputSRR030257.filtered.vcf

Is not practical, since we will lose vital VCF formatting and may not be able to use this file in the future for formats which require that formatting.

Code Block
cat inputSRR030257.vcf | grep -v AF1=0 > outputSRR030257.filtered.vcf

Will preserve all lines that don't have a AF=0 value and is one way of doing this.

Code Block
sed -i '/AF1=0/ d' inputSRR030257.vcf

Is a way of doing it in-line and not requiring you to make another file. (But it writes over your existing file!)

...

Follow the same directions to call variants in the BWA or Bowtie2 mapped reads with the improved quality. Just be sure you don't write over your old files. Maybe create new directories like BDIB_samtools_bwa and BDIB_samtools_bowtie_improved for the output in each case. As another fun version control issue, using the module system you can see that bowtie actually has 2 versions available: 1.1.2 and 2.2.6 ... these version could not be more different to the point where they should really be named different things. bowtie 1.1.2 is included only to keep old scripts that rely on it running, and is surpassed in everyway by bowtie2. If so desired, you could figure out how to run it yourself as a test of learning an unfamiliar program, or you could include any the output from said run from the precanned location below:

Warning

If you do not have the output from the Mapping tutorial, run the first 4 commands to copy over the output that would have been produced. Then, you can immediately start this tutorial!

Code Block
cds
mkdir BDIB_samtools_tutorial  # if directory already exists, don't worry it won't delete the current contents
cd BDIB_samtools_tutorial
cp -r $BI/gva_course/mapping/bowtie2 .
 
# optional commands for optional and more in-depth tutorials
cp -r $BI/gva_course/mapping/bowtie .  #note these files were created with bowtie NOT bowtie2 and is a drastically different read mapper despite the similar name
cp -r $BI/gva_course/mapping/bwa . 

These precanned results will be used in the optional upcoming bedtools tutorial as well, or you can simply compare the output .vcf files for more simple answers

...

  • Which mapper finds more variants?
  • Can you figure out how to filter the VCF files on various criteria, like coverage, quality, ... ?
  • How many high quality mutations are there in these E. coli samples relative to the reference genome?
  • Look at how the reads supporting these variants were aligned to the reference genome in the Integrative Genomics Viewer (IGV). Look into more sophisticated variant calling with GATK. We recommend starting from the GATK best practice pageThis will be a separate tutorial for tomorrow.

Module  version of samtools

...