Overview

Throughout the course we have focused on aligning a single sample of reads against a single reference genome with a few exceptions, but sometimes we know that is not the biological case. Sometimes we know that there are multiple different things present in a single sample. Most common situation would be a sample with a chromosome as well as a plasmid. Here we will examine the same sample used in the novel DNA identification tutorial to see how inclusion of the 2nd reference file changes the mapping results

Learning objectives

Understand how mapping against multiple reference genomes simultaneously decreases noise and increases confidence in variant calls.

Mapping against multiple references

Why is mapping against multiple references at the same time preferred to mapping against multiple different references 1 at a time? The answer relates to identifying real mutations from errors. As we discussed in our initial mapping tutorial/presentation, mapping scores and alignment scores are both related to how confident the mapping program is that an individual read is mapped to the correct location in the genome, and how well that that read aligns at that location.Imagine a hypothetical situation where in you have a 200bp region of a low copy plasmid that differs from the chromosome by a single base.

If you map against both references at the same time, the mapper will associate each read uniquely to either the plasmid region or the chromosome without having any mismatches.
If you map against the references separately, both will result in 50% of the reads aligned to this region as having a high quality mapping score, and a slightly diminished alignment score for both runs. In the case of clonal haploids, the 50% frequency would be concerning as 50% shouldn't occur under normal circumstances, but the duplication of a region followed by a single base change in one of the copies would produce an identical result.

Get some data

Here we will use the same data as was used in the novel DNA identification tutorial plus an additional reference file associated with the plasmid known to be present.

mkdir $SCRATCH/GVA_advanced_mapping
cp $BI/gva_course/novel_DNA/* $SCRATCH/GVA_advanced_mapping
cp $BI/gva_course/advanced_mapping/* $SCRATCH/GVA_advanced_mapping
cd $SCRATCH/GVA_advanced_mapping
ls

^ expected 2 fastq files and 2 gbk reference files

Set up the run

Hopefully by now, it will not surprise you that we will have a 3 step process: converting references to fasta, indexing the references, and mapping the reads. If it does recall the read mapping tutorial, and the novel DNA identification tutorial for additional information and examples. Here, less description of individual steps will be given.

Convert reference to fasta
```
module load bioperl
bp_seqconvert.pl --from genbank --to fasta < CP009273.1_Eco_BW25113.gbk > CP009273.1_Eco_BW25113.fasta
bp_seqconvert.pl --from genbank --to fasta < GFP_Plasmid_SKO4.gbk > GFP_Plasmid_SKO4.fasta
```
Recall that fasta files can have multiple sequence entries

And thus we could combine our new fasta files using the cat command and piping to create a single file containing both reference sequences. In that case, the same proceedure as was used in the original read mapping tutorial could be followed and will produce the same results as obtained here. Besides making it less obvious how bowtie2 is handling multiple reference sequences and semi defeating the purpose of this tutorial. I think it better practice to keep the reference files separate as from personal experience I can tell you it is easier to supply a list of different fasta files to be indexed than it is to edit fasta files for adding/removing multiple individual fasta entries as I recently did for a project involving 7 different references.
Remember to make sure you are on an idev done

For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.
Index the reference
```
mkdir bowtie2
module load bowtie/2.3.4
bowtie2-build --threads 48 CP009273.1_Eco_BW25113.fasta,GFP_Plasmid_SKO4.fasta bowtie2/BW25113_pSKO4
```
^ this is the first time we have used the --threads command while making our index, this is just giving the index command access to all the threads/processors available. While increasing the amount of text printed to our terminal, it also speeds up the process.
^ additionally, you may notice that it uses a much shorter name of the index inside the bowtie2 directory. Recall advice about naming and the risk of making multiple 'my-bowtie2-reference' files and being able to easily know what they arefor.

The following command will take ~13 seconds to complete. Before you run the command execute 'bowtie2 -h' so while the command is running you can try to figure out what the different options are doing that we did not include in our first tutorial.

Map reads

bowtie2 --very-sensitive-local -t -p 48 -x bowtie2/BW25113_pSKO4 -1 SRR4341249_1.fastq -2 SRR4341249_2.fastq -S bowtie2/SRR4341249-vsl.sam

Here we map in with the same conditions as was used in the novel DNA identification tutorial. A description of the options which may be new to you is as follows:

Click here to explain the new options.

option	effect
--very-sensitive-local	map in very sensitive local alignment mode
-t	print time info
-p 48	use 48 processors
-x bowtie2/BW25113_pSKO4	index
-1 SRR4341249_1.fastq -2 SRR4341249_2.fastq	reads used for mapping
-S bowtie2/SRR4341249-vsl.sam	Sam file detailing mapping

Why did this mapping take ~13 seconds despite having 2 references to map against while in the novel DNA identification tutorial it took ~5 minutes?

In the novel DNA identification tutorial, we additionally requested unmapped reads to be printed directly to a fastq file for downstream use.

Evaluating the results

Results from mapping output

What percent of reads mapped?

The stdoutput of the program listed:

99.31% overall alignment rate

For reference incase you did not do the novel DNA tutorial, only 65.74% of reads mapped to the chromosome alone. This suggests that more than 1/3 of all the reads came from the plasmid.

1 liner for for determining frequencies

Handy 1 liner to determine frequencies of reads mapping to each reference. Command broken down below.

echo -e "File\tReference\tReads";for sam in *.sam;do name=$(echo $sam | sed 's/.sam//');grep -v "^@" $sam | awk -F '\t' '{print $3}' | sort | uniq -c | awk -v sam="$name" '{sum+=$1}{print sam,$2,$1}END{print sam,"total",sum}' OFS='\t';done

Expected output:

File	Reference	Reads
SRR4341249-vsl	*	13730
SRR4341249-vsl	CP009273	1556647
SRR4341249-vsl	GFP_Plasmid_Sequ	882107
SRR4341249-vsl	total	2452484

Note that the alignment of columns is not awesome but can easily be copied to excel for better alignment.

Additionally note that this would work for any number of sam files (aka samples).

While the data presented above may be referred to as 'tidy format' where each line contains 1 piece of information, making it easy for computers to parse, personally, my human eyes prefer tables where rows are related to samples (in this case sam files), and columns are related to values (in this case chromosomes mapped to). I'm sure that someone better versed in bash/awk/etc could put together a script to do so directly from the raw file. As I'm better versed in python, instead I use the for loop from the above command, and pipe the tidy results to a .tsv file, then use a python script to convert it to a table. How can you get a hold of this awesome tool you ask? BioITeam of course.

Alternative command to create a mapping frequency table.

for sam in *.sam;do name=$(echo $sam | sed 's/.sam//');grep -v "^@" $sam | awk -F '\t' '{print $3}' | sort | uniq -c | awk -v sam="$name" '{sum+=$1}{print sam,$2,$1}END{print sam,"total",sum}' OFS='\t';done > ref_counts.tsv;bt2-mapping-ref-count-table.py -i ref_counts.tsv -o ref_counts_table.tsv

While the above doesn't print anything, 'cat ref_counts_table.tsv' prints the following:

Sample	CP009273	GFP_Plasmid_Sequ	Unmapped	Total
SRR4341249-vsl	1556647	882107	13730	2452484

Again the format is not great, but can be copy pasted to Excel for column alignment

Next steps

As noted above, the human trios tutorial deals with calling SNVs on multiple samples at once. Consider how you could combine the two steps

Since we are looking at a bacteria and it's plasmid, you can run this through breseq. Tutorial can be found here.

Return to GVA2020

Advanced bowtie2 -- GVA202