...
Throughout the course we have focused on aligning a single sample of reads against a single reference genome with a few exceptions, but sometimes we know that is not the biological case. Sometimes we know that there are multiple different things present in a single sample. Most common situation would be a sample with a chromosome as well as a plasmid. Here we will examine the same sample used in the novel DNA identification tutorial to see how inclusion of the 2nd reference file changes the mapping results.
Note |
---|
The discussion of concepts in this tutorial are identical to the advanced mapping with bowtie2 tutorial and work with the same data. Discussion of results is different in both tutorials. |
Learning objectives
- Understand how mapping against multiple reference genomes simultaneously decreases noise and increases confidence in variant calls.
- Map reads against multiple references, and evaluate mapping success/frequency.
Mapping against multiple references
...
Code Block | ||||
---|---|---|---|---|
| ||||
echo -e "File\tReference\tReads"; for sam in *.sam;do name=$(echo $sam | sed 's/.sam//');grep -v "^@" $sam | awk -F '\t' '{print $3}' | sort | uniq -c | awk -v sam="$name" '{sum+=$1}{print sam,$2,$1}END{print sam,"total",sum}' OFS='\t';done |
...
Note that the alignment of columns is not awesome but can easily be copied to excel for better alignment .Additionally note that and that this would work for any number of sam files (aka samples).
Info | |||||
---|---|---|---|---|---|
| |||||
Notice that if you divide the reads mapping to the chromosome (1556647) by the total reads (2452484) you get 63.47% not the 65.74% listed above for mapping to the single reference.
|
...
|
While the data presented above may be referred to as 'tidy format' where each line contains 1 piece of information, making it easy for computers to parse, personally, my human eyes prefer tables where rows are related to samples (in this case sam files), and columns are related to values (in this case chromosomes mapped to). I'm sure that someone better versed in bash/awk/etc could put together a script to do so directly from the raw file. As I'm better versed in python, instead I use the for loop from the above command, and pipe the tidy results to a .tsv file, then use a python script to convert it to a table. How can you get a hold of this awesome tool you ask? BioITeam of course.
...
Again the format is not great, but can be copy pasted to Excel or similar for column alignment. and in either case this will produce a single row for each sample and single column for each reference, while at the same time, changing the * category (which from reading info about samtools you would see is a read that does not map to any reference) to a more human understood 'unmapped' classification.
You were promised a description of the complicated for loop above.
No Format |
---|
example bash command to generate input file
for sam in *.sam;do name=$(echo $sam | sed 's/.sam//');grep -v "^@" $sam | awk -F '\t' '{print $3}' | sort | uniq -c | awk -v sam="$name" '{sum+=$1}{print sam,$2,$1}END{print sam,"total",sum}' OFS='\t';done > ref_counts.tsv
for sam in *.sam; # for loop looking at sam files
do # standard bash for loop formating
name=$(echo $sam | sed 's/.sam//'); # strip .sam ending from files
grep -v "^@" $sam | # remove header lines from consideration
awk -F '\t' '{print $3}' | # Pull the 3rd column from the sam file. 3rd column is name of chromosome read mapped to
sort | # Sort chromosome names (required for uniq to work correctly)
uniq -c | # take unique chromosome names and return the count of each
awk # Start of long awk command
-v sam="$name" # define awk variable "sam" using shell variable "$name" (see above) for later printing
'{sum+=$1} # define variable 'sum' as running count of how many total reads were evaluated
{print sam,$2,$1} # for each line (chromosome) print the sam file name, chromosome name, and number of reads mapped to that chromosome
END{print sam,"total",sum}' # after all lines (chromosomes) dealt with, print sam file name, the word "total", and the value of variable 'sum' (total number of reads)
OFS='\t'; # OutputFieldSeparator set to tab so output will be tsv
done # standard bash for loop end
> ref_counts.tsv # redirect output to ref_counts.tsv file |
The above box is a copy paste from the header information found in the bt2-mapping-ref-count-table.py file.
Code Block | ||||
---|---|---|---|---|
| ||||
less `which bt2-mapping-ref-count-table.py` |
Tip |
---|
Note 1 last trick here where the text between the ` marks (backtick NOT single quote) is evaluated fist allowing you to open the script for viewing in the less viewer. Frustratingly, once you hit the first ` mark, you lose the ability to tab complete often making it easier to type the which command with tab completion, then go back to add the ` marks. |
Next steps
As noted above, the human trios tutorial deals with calling SNVs on multiple samples at once. Consider how you could combine the two steps
Since we are looking at a bacteria and it's plasmid, you can run this through breseq. Tutorial can be found here.