Identification and annotation of SNPs and/or somatic mutations compared to reference genome. 10 hour minimum ($470 internal, $600 external) per project.
1. Quality Assessment
Quality of data assessed by FastQC and SAMStat; results of quality assessment will be evaluated prior to downstream analysis.
- Deliverables:
- reports generated by FastQC and SAMStat
- metrics specific to hybrid selection analysis calculated using Picard available as well
- Tools Used:
- FastQC: (Andrews 2010) used to generate quality summaries of data:
- Per base sequence quality report: useful for deciding if trimming necessary.
- Sequence duplication levels: evaluation of library complexity.
- Overrepresented sequences: evaluation of adapter contamination.
- SAMStat: (Lassman et. al. 2011) provides summary statistics at both fastq and SAM/BAM alignment levels.
- Picard CalculateHsMetrics: (http://broadinstitute.github.io/picard) evaluates hybrid selection protocols (target coverage and AT/GC dropout levels).
- FastQC: (Andrews 2010) used to generate quality summaries of data:
2. Mapping
Mapping to genome reference using BWA-mem (alternative algorithms available on request).
- Deliverables:
- bam files from both the initial alignment (BWA-mem by default, though other algorithms are available if desired)
- bam files resulting from further processing using GATK
- Tools Used:
- BWA-mem: (Li 2013) primary aligner used to generate first pass read alignments (BWA-aln and BWA-sampe also available if desired, as are bowtie/bowtie2).
- GATK: (McKenna et. al. 2010, Auwera et. al. 2013) IndelRealigner and BaseRecalibrator applied to correct indel-based misalignments and increase accuracy/dispersion of individual base quality scores
3a. Variant Calling Option 1: GATK
Genome Analysis Toolkit (GATK) used to call SNPs and indels according to best practices recommended by Broad institute.
- Deliverables:
- individual sample vcf files output by HaplotypeCaller
- regenotyped and recalibrated merged vcf file output by GenotypeGVCFs
- Tools Used (GATK):
- HaplotypeCaller: reassembles "active regions" and applies PairHMM algorithm to select most likely genotype
- GenotypeGVCFs: jointly re-genotypes, re-annotates and merges individual sample gVCFs from HaplotypeCaller into single aggregated vcf file
- VariantRecalibrator: recalibrates variant call probabilities based on call annotations
3b. Variant Calling Option 2: Somatic Mutation Identification
MuTect and MutSig from the Broad institute are available for calling somatic mutations; other methods may be available upon request as well.
- Deliverables:
- MuTect and MutSig output files.
- Tools Used:
- MuTect: (Cibulskis et. al. 2013) identifies somatic point mutations based on two Bayesian classifiers:
- LOD for observed tumor data given mutant site compared to observed tumor data given reference site,
- LOD for observed normal data given reference site compared to observed normal data given mutant site.
- MutSig: (Lawrence et. al. 2013) assesses significance of mutation calls using null model based on background mutation processes.
4. Annotation
Further annotation of variant calls may be provided using ANNOVAR.
- Deliverables:
- ANNOVAR output in tabular format (in plain text, csv, or excel format as desired).
- Tools Used:
- ANNOVAR: (Wang et. al. 2010) provides functional annotation of genetic variation encompassing multiple modalities (e.g., gene and region annotation and/or filtration based on established data sets).