...
This example would work if your foward restriction enzyme is NlaIII, your reverse restriction enzyme is MluCI, and your insert size is between 264 and 336 base pairs.
Does this really work?
This fake reference has 4 RAD fragments, 3 with 100-300 bp insert sizes and 1 more with a 350 bp insert size for NlaIII and MluCI
...
Code Block | ||||
---|---|---|---|---|
| ||||
#note this may not work on a Mac, try on TACC if it doesn't
cd in_silico_digest/
cat test_reference.fasta | tr -d "\n" | grep -o -E "CATG.{100,300}AATT" | wc -l
cat test_reference.fasta | tr -d "\n" | grep -o -E "CATG.{100,350}AATT" | wc -l |
...
Exercise:
Imagine you wanted to do ddRAD on the coral Acropora hyacinthus
...
Your lab has the restriction enzymes Sphl and MspI on hand and you want to use those as your forward and reverse cutter respectively.
You are sequencing three sample populations and based on power analyses you have decided you'll need sample population sizes of n=20 (N=60 total samples).
Finally, you have decided to do 150 bp paired end sequencing on the Illumina Hiseq 4000 and have exactly $3,239 dollar to dedicate to sequencing.
Assuming you will get 240 million reads from a single lane of an Illumina Hiseq 4000 run, an a minimum insert size of 300, what should you pick your maximum insert size as?
How would you use this to get an idea of how many loci you are likely to find.
- Search NCBI for a reference genome
- Download it
- Search for your specific restriction search parameters
...
Solution:
Code Block | ||||
---|---|---|---|---|
| ||||
#start an idev session if you have not idev #make a directory to do the analysis in cds mkdir rad_intro cd rad_intro mkdir in_silico_digest cd in_silico_digest #download the reference from NCBI wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/222/465/GCF_000222465.1_Adig_1.1/GCF_000222465.1_Adig_1.1_genomic.fna.gz #decompress it gunzip GCF_000222465.1_Adig_1.1_genomic.fna.gz #check the command on some different max insert sizes cat GCF_000222465.1_Adig_1.1_genomic.fna | tr -d "\n" | grep -o -E "GCATGC.{300,400}CCGG" | wc -l cat GCF_000222465.1_Adig_1.1_genomic.fna | tr -d "\n" | grep -o -E "GCATGC.{300,500}CCGG" | wc -l cat GCF_000222465.1_Adig_1.1_genomic.fna | tr -d "\n" | grep -o -E "GCATGC.{300,600}CCGG" | wc -l #use #This can be done in a loop: to run command#set forup athe seriessizes ofyou maximumwant insertto sizescheck seq 400 100 1000 > maxInserts.txt less maxInserts.txt #build a header for your output echo -e "insert_range\tNsites" > prediction_results.txt #loop through your max insert sizes to get predictions for the number of sites while read max; do nSites=$(cat GCF_000222465.1_Adig_1.1_genomic.fna | tr -d "\n" | grep -o -E "GCATGC.{300,${max}}CCGG" | wc -l); echo -e "300-${max}\t${nSites}" >> prediction_results.txt; done < maxInserts.txt #look at the results cat prediction_results.txt |
#These values are somewhat low #What about a four-nucleotide restriction enzyme? #for NlaIII cat GCF_000222465.1_Adig_1.1_genomic.fna | tr -d "\n" | grep -o -E "CATG.{300,400}CCGG" | wc -l |