Exome Capture Metrics GVA2022
Data used in this tutorial
Recommended but not required that you first complete the trios tutorial and use the data you generated here. Alternatively, canned data provided.
Overview
As we discussed during some of our variant calling, the majority of the reads, and bases within reads, will map perfectly with the reference genome (unless you are using a reference genome that needs some improvement in which case, you should work towards that end possibly with the spades or novel DNA tutorials as command line jump offs. As so much of the reads are just telling you what you already know, it is possible to design your experiments to target reads to a given location or sets of locations in the genome via sequence capture. One of the most common things to target is the exome, or a portion of the exome as mutations in exons are more likely to have a functional consequence than mutations in introns or intergenic regions.
As we mentioned in our discussions on experimental design, it is always better to start from the best data possible by conducting a well planned out experiment. Part of that when working with exome (or other targeted sequencing approaches) is to make sure that the enrichment that you are doing is successful, and that you know how much sequencing you need to do to have good coverage of what you are looking at. There are actually several different ways to measure sequence capture depending on what you are doing. You might care more about minimizing off-target capture, to make your sequencing dollars go as far as possible. Or you might care more about maximizing on-target capture, to make sure you get data from every region of interest. These two are usually negatively correlated.
Learning objectives
- Install picard to a conda environment.
- Use picard's CollectHsMetrics function to determine what fraction of reads were associated with the exome of chromosome 20.
Picard
Picard is another tool like like gatk also put out by the broad. As of gatk version 4.0, gatk began bundling picard with all of its distributions. This means if you have already done the gatk tutorial you already have access to picard! Alternatively, as picard is just a subset of the gatk package, it may be of interest for you to only install picard tools rather than the full suite. Further, conda offers a "picard-slim" packaging that includes most of the picard tools, but not those few that would otherwise require an associated R program.
Installing Picard
Here I will present 2 different methods of installing/accessing picard and its associated tools. The most basic guidance I have is that I don't see a lot of downside to installing the full gatk package unless you know you only need a limited number of tools that are in picard, and don't want to 'clutter' up your environment.
If you wish to install picard as part of the gatk package (allows you to access commands via gatk toolname)
As was done in our gatk tutorial, the commands for putting gatk in its own environment and activating it are below. The gatk tutorial contains slightly more information about this installation if you are interested.conda create --name GVA-gatk -c bioconda gatk4 conda activate GVA-gatk gatk --version
The Genome Analysis Toolkit (GATK) v4.2.6.1 HTSJDK Version: 2.24.1 Picard Version: 2.27.1
If you wish to install picard as a stand alone package (allows you to access commands via picard toolname). Since the only reason that jumps out at me to do this is to not "clutter" things, we'll go with the "slim" package, though changing to the full picard package would not be difficult.
conda create --name GVA-picard -c bioconda picard-slim conda activate GVA-picard picard --version
The version command above will actually print the picard help file (aka list of all picard tools). When conda was downloading the package, it appears we are getting version 2.27.3, but I don't see any way to directly access picard's installed version from the command line, though individual tool's version can be access via
picard ToolName --version.
The listing of all of picard's tools tells us that we have successfully installed picard though.
Get some data
Here is a link to the full picard documentation and here is a link to the CollectHsMetrics tool
To run CollectHsMetrics, there are three prerequisites: 1) A bam file and 2) a list of the genomic intervals that were to be captured and 3) the reference (.fa). As you would guess, the BAM and interval list both have to be based on exactly the same genomic reference file.
For our tutorial, the bam files are one of these:
/corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam
I've started with one of Illumina's target capture definitions (the vendor of your capture kit will provide this) but since the bam files only represent chr20 data I've created a target definitions file from chr20 only as well. Here they are:
/corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.chr20.reduced.withhead.intervallist /corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.reduced.withhead.intervallist
And the relevant reference is:
/corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa.fai
mkdir $SCRATCH/GVA_Exome_Capture cd $SCRATCH/GVA_Exome_Capture cp $SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam . cp $SCRATCH/GVA_Human_trios/raw_files/target_intervals.chr20.reduced.withhead.intervallist . cp $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa . cp $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa.fai .
Using Picard's "CollectHsMetrics" function to evaluate capture
As with other gatk based commands we have run, these commands while quick (2-3 minutes) will not finish on the head node due to memory problems
Make sure you are on an idev node with hostname or showq -u commands. Additional information for launching an idev node is found here.
The following command is rather large, note that there are a total of 4 commands in the next window: 2 corresponding to loading the correct environment on the compute node, and 2 corresponding to CollectHsMetrics commands. Make sure your copy paste does not introduce additional line returns.
conda activate GVA-gatk gatk CollectHsMetrics BAIT_INTERVALS=target_intervals.chr20.reduced.withhead.intervallist TARGET_INTERVALS=target_intervals.chr20.reduced.withhead.intervallist INPUT=NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam REFERENCE_SEQUENCE=hs37d5.fa OUTPUT=exome.picard.stats PER_TARGET_COVERAGE=exome.pertarget.stats # conda activate GVA-picard picard CollectHsMetrics BAIT_INTERVALS=target_intervals.chr20.reduced.withhead.intervallist TARGET_INTERVALS=target_intervals.chr20.reduced.withhead.intervallist INPUT=NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam REFERENCE_SEQUENCE=hs37d5.fa OUTPUT=exome.picard.stats PER_TARGET_COVERAGE=exome.pertarget.stats
Both commands will produce identical results, and nearly identical output to the screen when run (provided you have installed things correctly). Your output is expected to be similar to this:
INFO 2022-06-22 23:04:42 CollectHsMetrics Processed 1,000,000 records. Elapsed time: 00:00:53s. Time for last 1,000,000: 53s. Last read position: 20:14,922,373 INFO 2022-06-22 23:05:31 CollectHsMetrics Processed 2,000,000 records. Elapsed time: 00:01:43s. Time for last 1,000,000: 49s. Last read position: 20:32,404,938 INFO 2022-06-22 23:06:24 CollectHsMetrics Processed 3,000,000 records. Elapsed time: 00:02:36s. Time for last 1,000,000: 52s. Last read position: 20:43,530,401 INFO 2022-06-22 23:07:16 CollectHsMetrics Processed 4,000,000 records. Elapsed time: 00:03:27s. Time for last 1,000,000: 51s. Last read position: 20:56,062,712 INFO 2022-06-22 23:07:40 TheoreticalSensitivity Creating Roulette Wheel INFO 2022-06-22 23:07:41 TheoreticalSensitivity Calculating quality sums from quality sampler INFO 2022-06-22 23:07:41 TheoreticalSensitivity 0 sampling iterations completed INFO 2022-06-22 23:07:41 TheoreticalSensitivity 1000 sampling iterations completed INFO 2022-06-22 23:07:41 TheoreticalSensitivity 2000 sampling iterations completed INFO 2022-06-22 23:07:42 TheoreticalSensitivity 3000 sampling iterations completed INFO 2022-06-22 23:07:42 TheoreticalSensitivity 4000 sampling iterations completed INFO 2022-06-22 23:07:43 TheoreticalSensitivity 5000 sampling iterations completed INFO 2022-06-22 23:07:43 TheoreticalSensitivity 6000 sampling iterations completed INFO 2022-06-22 23:07:44 TheoreticalSensitivity 7000 sampling iterations completed INFO 2022-06-22 23:07:44 TheoreticalSensitivity 8000 sampling iterations completed INFO 2022-06-22 23:07:45 TheoreticalSensitivity 9000 sampling iterations completed INFO 2022-06-22 23:07:45 TheoreticalSensitivity Calculating theoretical het sensitivity INFO 2022-06-22 23:07:46 TargetMetricsCollector Calculating GC metrics [Wed Jun 22 23:07:47 CDT 2022] picard.analysis.directed.CollectHsMetrics done. Elapsed time: 4.09 minutes. Runtime.totalMemory()=1581252608 Tool returned: 0
The aggregate capture data is in exome.picard.stats, but if you check it with the head command you'll see its format isn't very nice; here's a linux one-liner to reformat the two useful lines (one is the header, the other is the data) into columns, along with the result:
grep -A 1 '^BAIT' exome.picard.stats | awk 'BEGIN {FS="\t"} {for (i=1;i<=NF;i++) {a[NR"_"i]=$i}} END {for (i=1;i<=NF;i++) {print a[1"_"i]"\t"a[2"_"i]}}'
Here is the output of the above command, DO NOT! paste this into the command line.
BAIT_SET target_intervals BAIT_TERRITORY 1843371 BAIT_DESIGN_EFFICIENCY 1 ON_BAIT_BASES 94769476 NEAR_BAIT_BASES 54852614 OFF_BAIT_BASES 160815315 PCT_SELECTED_BASES 0.481972 PCT_OFF_BAIT 0.518028 ON_BAIT_VS_SELECTED 0.633392 MEAN_BAIT_COVERAGE 51.410962 PCT_USABLE_BASES_ON_BAIT 0.272266 PCT_USABLE_BASES_ON_TARGET 0.212359 FOLD_ENRICHMENT 519.58801 HS_LIBRARY_SIZE 5325189 HS_PENALTY_10X 225.120249 HS_PENALTY_20X -1 HS_PENALTY_30X -1 HS_PENALTY_40X -1 HS_PENALTY_50X -1 HS_PENALTY_100X -1 TARGET_TERRITORY 1843371 GENOME_SIZE 3137454505 TOTAL_READS 4579959 PF_READS 4579959 PF_BASES 348076884 PF_UNIQUE_READS 4208881 PF_UQ_READS_ALIGNED 4161913 PF_BASES_ALIGNED 310437405 PF_UQ_BASES_ALIGNED 286584646 ON_TARGET_BASES 73917336 PCT_PF_READS 1 PCT_PF_UQ_READS 0.918978 PCT_PF_UQ_READS_ALIGNED 0.988841 MEAN_TARGET_COVERAGE 40.099001 MEDIAN_TARGET_COVERAGE 7 MAX_TARGET_COVERAGE 1293 MIN_TARGET_COVERAGE 0 ZERO_CVG_TARGETS_PCT 0.005348 PCT_EXC_DUPE 0.076836 PCT_EXC_ADAPTER 0 PCT_EXC_MAPQ 0.018709 PCT_EXC_BASEQ 0.04137 PCT_EXC_OVERLAP 0.084969 PCT_EXC_OFF_TARGET 0.540013 FOLD_80_BASE_PENALTY 20.049501 PCT_TARGET_BASES_1X 0.924958 PCT_TARGET_BASES_2X 0.820112 PCT_TARGET_BASES_10X 0.473017 PCT_TARGET_BASES_20X 0.423769 PCT_TARGET_BASES_30X 0.385684 PCT_TARGET_BASES_40X 0.34762 PCT_TARGET_BASES_50X 0.308718 PCT_TARGET_BASES_100X 0.137769 PCT_TARGET_BASES_250X 0.011475 PCT_TARGET_BASES_500X 0.000228 PCT_TARGET_BASES_1000X 0.000022 PCT_TARGET_BASES_2500X 0 PCT_TARGET_BASES_5000X 0 PCT_TARGET_BASES_10000X 0 PCT_TARGET_BASES_25000X 0 PCT_TARGET_BASES_50000X 0 PCT_TARGET_BASES_100000X 0 AT_DROPOUT 2.023144 GC_DROPOUT 10.378868 HET_SNP_SENSITIVITY 0.720608 HET_SNP_Q 6 SAMPLE LIBRARY READ_GROUP
Taking the output even further
It is rare that you ever want to work with a single sample. While this format is nice for a single sample, comparing the same data across multiple samples would not be the easiest to do with this format. Instead, putting this information to a file, then using grep and awk you could make a small table of the specific information you want.
Since I don't actually know what capture kit was used to produce these libraries, these may or may not accurately reflect how well the library prep went, but generally speaking having >40x average coverage on your baits (the target regions) is good, as is over 500 fold enrichment. While it may be tempting to consider 52% of reads being 'off bait' as a bad thing, instead consider that ~48% of reads mapped to just ~0.06% of the genome.
Additional Exercises:
These results were based on sample NA12878. How do the other 2 samples (NA12891, and NA12892) from the trios tutorial compare for their enrichment? If you are unsure how differences in enrichment could effect analysis, this is a great discussion topic.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.