...
Principal Component Analysis
Use On personal computer, use the R script snp_pca.R to do PCA on the data. What's going on with chromosome 12?
...
Code Block | ||||
---|---|---|---|---|
| ||||
#(back on TACC)
#First choose a window size and step size (here 100Kb)
WINDOW_SIZE=100000
WINDOW_STEP=100000
#Run for chromosome 1
vcftools --vcf chrI.vcf \
--weir-fst-pop Pun_males.txt \
--weir-fst-pop Pun_females.txt \
--fst-window-size $WINDOW_SIZE \
--fst-window-step $WINDOW_STEP \
--out chrI_female_v_male
#chromosome 2
vcftools --vcf chrII.vcf \
--weir-fst-pop Pun_males.txt \
--weir-fst-pop Pun_females.txt \
--fst-window-size $WINDOW_SIZE \
--fst-window-step $WINDOW_STEP \
--out chrII_female_v_male
#chromosome12
vcftools --vcf chrXII.vcf \
--weir-fst-pop Pun_males.txt \
--weir-fst-pop Pun_females.txt \
--fst-window-size $WINDOW_SIZE \
--fst-window-step $WINDOW_STEP \
--out chrXII_female_v_male
#look at result files
ls -l *windowed.weir.fst
#send these to pungitius_exercise/ on personal computer to plot with R |
...
Code Block | ||||
---|---|---|---|---|
| ||||
#change to directory with species-separated vcfs
cd species_separated
#select window size and step (100Kb)
WINDOW_SIZE=100000
WINDOW_STEP=100000
#loop through vcfs and build commands for running --window-pi
>piCommands
for vcfFile in *.recode.vcf
do echo "vcftools --vcf $vcfFile --window-pi $WINDOW_SIZE --window-pi-step $WINDOW_STEP --out ${vcfFile/.recode.vcf} &" >> piCommands
done
#execute them
#look at results files
ls -l *windowed.pi
#send to pungitius_exercise/ on personal computer to plot with |
Plot the resulting *weir.fst and *.pi results with plot_window_stats.R
A pseudoautosomal region (PAR) is a region of a sex chromosome that still undergoes recombination.
...
Code Block | ||||
---|---|---|---|---|
| ||||
probably at about 3 Mb |
Admixture
Run admixture on chromosome 1 and also the PAR and the divergent region of chr12
(based on the Fst plot, let's use these boundaries: PAR: 0 - 2Mb, divergent: 4 - 17 Mb)
Code Block | ||||
---|---|---|---|---|
| ||||
#first switch to integers, because plink doesn't like roman numerals
sed 's/chrI/1/' chrI.vcf > chr1.vcf
sed 's/chrXII/12/' chrXII.vcf > chr12.vcf
#subset for the PAR and divergent region
vcftools --vcf chr12.vcf --chr 12 --from-bp 0 --to-bp 2000000 --recode --out PAR
vcftools --vcf chr12.vcf --chr 12 --from-bp 4000000 --to-bp 17000000 --recode --out SDR
#convert to bed files with plink
plink2 --vcf chr1.vcf --make-bed --out chr1 --allow-extra-chr
plink2 --vcf PAR.recode.vcf --make-bed --out PAR --allow-extra-chr
plink2 --vcf SDR.recode.vcf --make-bed --out SDR --allow-extra-chr
#run admixture
admixture chr1.bed 3
admixture PAR.bed 3
admixture SDR.bed 3
#send the *.Q files to personal computer to plot with plot_pungitius_admixture.R |
Phylogenetics
Code Block |
---|
(on TACC)
cd phylogenetics |
Build gene trees from the PAR and from the divergent regions of chr12
Where did the Y chromosome come from?
Code Block | ||||
---|---|---|---|---|
| ||||
#subset the vcf for the two regions
vcftools --vcf chr12_for_phylo.vcf --chr chrXII --from-bp 0 --to-bp 2000000 --recode --out PAR
vcftools --vcf chr12_for_phylo.vcf --chr chrXII --from-bp 4000000 --to-bp 17000000 --recode --out SDR
#use vcf-kit to build trees
vk phylo tree upgma PAR.recode.vcf > PAR.newick
vk phylo tree upgma SDR.recode.vcf > SDR.newick
#look at the trees on TACC with view_newick.py
view_newick.py PAR.newick
view_newick.py SDR.newick |
For a nicer tree viewer, Dendroscope works well.
Build gene trees in windows along chromosome
Code Block | ||||
---|---|---|---|---|
| ||||
#set up boundaries
seq 0 250000 19750000 > lefts
seq 250000 250000 20000000 > rights
paste lefts rights > bounds.txt
#now build commands to subset the vcf with VCFtools, then pipe the subset into VCF-kit for tree building
awk '{print "vcftools --vcf chr12_for_phylo.vcf --chr chrXII --from-bp "$1 " --to-bp "$2" --recode -c | vk phylo tree upgma - > window_"$1"-"$2".newick"}' bounds.txt > treeBuildCommands
#check the commands (should have 80 lines)
head treeBuildCommands
wc -l treeBuildCommands
#Double-check you are in an idev session
#If you are in an idev session, and had 80 lines, print to screen and execute
sh treeBuildCommands
#check that you have all trees
ls window*.newick | wc -l
#compile the resulting trees into a single file
#they need to be in order, so we can just simply cat them
#read bounds.txt again to cat them in order
echo ">allTrees.newick" >catTrees
while read bound
do left=$(echo $bound | cut -d " " -f 1)
right=$(echo $bound | cut -d " " -f 2)
echo "cat window_${left}-${right}.newick | tr -d '\n' >> allTrees.newick && echo '' >>allTrees.newick" >> catTrees
done<bounds.txt
#execute the catting commands
sh catTrees
#check that the input is correct size (should be 80)
wc -l allTrees.newick
#now run Twisst on them
~/twisst/twisst.py -t allTrees.newick\
-w topology_weights.txt\
-g Pun_X -g Pun_Y -g Sin -g Tym\
--groupsFile groups_file.txt\
--method threshold\
--thresholdTable ~/twisst/threshold_tables/binom_threshold_wilson_CI95_0.1.tsv
#This takes a bit.
#Once it's done, format the output
grep "^#" topology_weights.txt > topos.txt
grep -v "^#" topology_weights.txt > weights.txt
#send the results (all end in *.txt) to personal computer to plot with plot_twisst_results.R
|