DNAseq Variant Calling Pipeline
Identification and annotation of SNPs and/or somatic mutations compared to reference genome. 10 hour minimum ($730 internal, $930 external) per project.
1. Quality Assessment
Quality of data assessed by FastQC and SAMStat; results of quality assessment will be evaluated prior to downstream analysis.
- Deliverables:
- reports generated by FastQC and SAMStat
- metrics specific to hybrid selection analysis calculated using Picard available as well
- Tools Used:
- FastQC: (Andrews 2010) used to generate quality summaries of data:
- Per base sequence quality report: useful for deciding if trimming necessary.
- Sequence duplication levels: evaluation of library complexity.
- Overrepresented sequences: evaluation of adapter contamination.
- SAMStat: (Lassman et. al. 2011) provides summary statistics at both fastq and SAM/BAM alignment levels.
- Picard CalculateHsMetrics: (http://broadinstitute.github.io/picard) evaluates hybrid selection protocols (target coverage and AT/GC dropout levels).
- FastQC: (Andrews 2010) used to generate quality summaries of data:
2. Mapping
Mapping to genome reference using BWA-mem (alternative algorithms available on request).
- Deliverables:
- bam files from both the initial alignment (BWA-mem by default, though other algorithms are available if desired)
- bam files resulting from further processing using GATK
- Tools Used:
- BWA-mem: (Li 2013) primary aligner used to generate first pass read alignments (BWA-aln and BWA-sampe also available if desired, as are bowtie/bowtie2).
- GATK: (McKenna et. al. 2010, Auwera et. al. 2013) IndelRealigner and BaseRecalibrator applied to correct indel-based misalignments and increase accuracy/dispersion of individual base quality scores
3a. Variant Calling Option 1: GATK
Genome Analysis Toolkit (GATK) used to call SNPs and indels according to best practices recommended by Broad institute.
- Deliverables:
- individual sample vcf files output by HaplotypeCaller
- regenotyped and recalibrated merged vcf file output by GenotypeGVCFs
- Tools Used (GATK):
- HaplotypeCaller: reassembles "active regions" and applies PairHMM algorithm to select most likely genotype
- GenotypeGVCFs: jointly re-genotypes, re-annotates and merges individual sample gVCFs from HaplotypeCaller into single aggregated vcf file
- VariantRecalibrator: recalibrates variant call probabilities based on call annotations
3b. Variant Calling Option 2: Somatic Mutation Identification
MuTect and MutSig from the Broad institute are available for calling somatic mutations; other methods may be available upon request as well.
- Deliverables:
- MuTect and MutSig output files.
- Tools Used:
- MuTect: (Cibulskis et. al. 2013) identifies somatic point mutations based on two Bayesian classifiers:
- LOD for observed tumor data given mutant site compared to observed tumor data given reference site,
- LOD for observed normal data given reference site compared to observed normal data given mutant site.
- MutSig: (Lawrence et. al. 2013) assesses significance of mutation calls using null model based on background mutation processes.
4. Annotation
Further annotation of variant calls may be provided using ANNOVAR.
- Deliverables:
- ANNOVAR output in tabular format (in plain text, csv, or excel format as desired).
- Tools Used:
- ANNOVAR: (Wang et. al. 2010) provides functional annotation of genetic variation encompassing multiple modalities (e.g., gene and region annotation and/or filtration based on established data sets).
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.