MegaMapper is comprised of two pipelines developed by the Megason Lab at Harvard University. The pipelines are HMFseq and BSFseq.

The following instructions detail how to run MegaMapper on TACC from the command line. This is a DRAFT version. This pipeline has not been fully validated. Some of the scripts have file references hard-coded in the source. The pipeline will be refined as demand dictates.

HMFseq

Create a directory for your run. Let's call our directory Sample, so just run
```
mkdir Sample
```
Run fastExon-mpileup to map reads and generate a VCF file. fastExon splits the data into chunks whose size you determine, runs BWA against the reference, and generates a VCF using SamTools mpileup. Right now the code needs to be edited to specify the reference.
```
fastExon-mpileup Sample_R1.fastq Sample_R2.fastq 8000000 3
```
The first number after the FASTQ files specifies how many lines should be in each file chunk (for a FASTQ file, it must be a multiple of 4), the second number determines how many subdirectories the file chunks will be grouped in.

Clean up raw SNPs.

cat Test.snps.vcf | java -jar /work/01863/benni/snpEff_3_1/SnpSift.jar filter "( DP > 2 ) & ( DP < 32 ) & ( QUAL >= 10 ) ! (REF = 'N') ! (ALT = 'N')" > mapping_clean_snps.vcf

If a memory error is thrown, try loading java64 before running SnpSift and increase the memory size:

module load java64
cat Test.snps.vcf | java -Xmx6G -jar /work/01863/benni/snpEff_3_1/SnpSift.jar filter "( DP > 2 ) & ( DP < 32 ) & ( QUAL >= 10 ) ! (REF = 'N') ! (ALT = 'N')" > mapping_clean_snps.vcf

Clean up raw SNPs for mapping.

cat variants.snps.vcf | java -Xmx6G -jar /work/01863/benni/snpEff_3_1/SnpSift.jar filter "( DP > 5 ) & ( DP < 32 ) & ( QUAL >= 10 ) & (DP4[2] > 0) & (DP4[3] > 0) ! (REF = 'N') ! (ALT = 'N')" > mapping_snps.vcf

Remove VCF header.

sed '1,27d' mapping_snps.vcf > mapping_snps_headless.vcf

Extract coverage on mapping SNPs.

python covextract.py -f mapping_snps_headless.vcf > mapping_snps.tab

Fix formatting error in first column. The HMFseq R script expects the first column to be pure numbers.
```
gawk -F=$'\t' 'sub(/variants.*bcf:/, "", $1)' mapping_snps.tab > mapping_snps_nums.tab
```

Run HMFseq R script.

Rscript --vanilla /corral-repl/utexas/BioITeam/bin/MegaMapper/HMFseq_Rscript mapping_snps_nums.tab HMFseq_output.tab HMFseq_genome_plot.pdf HMFseq_chromosome_scan.pdf "<Run_Name>"

(Here <Run_Name> is whatever you want to call this run.)

Run SnpEff on clean SNPs

java -jar /work/01863/benni/snpEff_3_1/snpEff.jar eff -c /work/01863/benni/snpEff_3_1/snpEff.config -i vcf -o txt -upDownStreamLen 10000 -hom -no None,downstream,intergenic,intron,upstream,utr -stats all_SNPeff_output.tab Zv9.68 clean_snps_nums.vcf > snpEff_output.txt

Load bedtools
```
module load bedtools
```
SNP Subtractor to substract dbSNP and WT (wild-type?) SNPs from clean SNPs This step uses a dbSNP library in my own directory, which should probably be moved somewhere more central.
```
intersectBed -v -a clean_snps_nums.vcf -b ../dbSNP/vcf_chr_1-25.vcf > remaining_snps1.vcf
intersectBed -v -a remaining_snps1.vcf -b WT_SNP_set.vcf > remaining_snps2.vcf
```

Reattach a VCF header

python /corral-repl/utexas/BioITeam/bin/MegaMapper/VCFheader_mod.py remaining_snps3.vcf remaining_snps2.vcf

SnpEff again on unique SNPs

java -jar /corral-repl/utexas/BioITeam/bin/snpEff_3_1/snpEff.jar eff -c /corral-repl/utexas/BioITeam/bin/snpEff_3_1/snpEff.config -i vcf -o txt -upDownStreamLen 5000 -hom -no None,downstream,intergenic,intron,upstream,utr -stats unfiltered_stats.tab Zv9.68 remaining_snps3.vcf > unfiltered_candidate_SNPs.tab

Find SNPs in predicted critical interval

Rscript --vanilla /corral-repl/utexas/BioITeam/bin/MegaMapper/candidator.R unfiltered_candidate_SNPs.tab HMFseq_output.tab candidate_list.tab

The candidate_list.tab file is your output. Congratulations, you're done!

Bioinformatics Team (BioITeam) at the University of Texas

MegaMapper

Analytics

HMFseq