process 2bRAD

Get 2bRAD scripts:

Get 2bRAD denovo from github

#download the scripts from github
cdh
git clone https://github.com/z0on/2bRAD_denovo.git

#add the path to 2bRAD to your PATH variable
cd 2bRAD_denovo
2bRAD_PATH=$(pwd)
export PATH="$PATH:~/2bRAD_denovo/"

#test you have access
cds
vcf2map.pl -h
#if you get a help menus it worked

Check the reads

We sequenced four different samples with single-end reads across two lanes on an Illumina Hiseq.

solution

#navigate to the directory
cd process_2bRAD


#look at our reads
ls *fastq.gz


#decompress them
gunzip *.gz


#We have two groups, A and B, that are labeled with
#barcode sequences CTGCAG and GAAGTT respectively.
#These groups represent pools of samples that all had
#the same barcode (attached during PCR step in library prep)
#Each group was has two files labeled with L004 and L005 indicating lanes 4 and 5 from the sequencing run
#concatenate the lane 4 and lane 5 data for each pool
cat A_CTGCAG_L004_R1_001.fastq A_CTGCAG_L005_R1_001.fastq > A_CTGCAG_R1.fastq
cat B_GAAGTT_L004_R1_001.fastq B_GAAGTT_L005_R1_001.fastq > B_GAAGTT_R1.fastq


#Look at the barcode data to get an idea of what these fastq files are
#This is the same information that would be uploaded to GSAF when the
#samples were submitted for sequencing.
less barcode_data.tsv

Compare this with expected final product from library prep:

Within each group (A or B) all samples had the same PCR barcode which appears in the file name (6 bases long).

The PCR barcode was read by the indexing primer and will not show up in the reads themselves. GSAF already used this barcode to demultiplex the reads for you into the groups A and B.

The individual samples in these groups have different ligation barcodes (4 bases long), which will appear within the sequencing reads. These ligation barcodes will be used to demultiplex the pooled fastq files for groups A and B into fastqs for the individual samples.

bcgl cut site:

Check the read content

The vast majority of our reads should have the bcgI recognition site.

How can we check this?

solution

#This library was prepared with bcgl restriction enzyme
#Check that the cut site appears in the reads
#How many reads do we have?
expr $(cat A_CTGCAG_R1.fastq | wc -l) / 4
expr $(cat B_GAAGTT_R1.fastq | wc -l) / 4
    #670378
	#649445
 
#(Note these are small for demo purposes)

#How many of the reads have the cut site?
grep "CGA......TGC" A_CTGCAG_R1.fastq | wc -l
grep "CGA......TGC" B_GAAGTT_R1.fastq | wc -l
    #342813
	#333921
  
#3429158/670378 = 0.51

Why is the cut site only in half our reads?

solution

#In the 2bRAD library prep the adapters can be ligated to the fragment in either
#fragment in either orientation (note the mirrored sticky ends left by bcgl).
#So the genomic fragment may have been read from either direction
  
#check for reverse complement of cut site
grep "GCA......TCG" A_CTGCAG_R1.fastq | wc -l
grep "GCA......TCG" B_GAAGTT_R1.fastq | wc -l
    #331339
	#319385

#(Note these are doctored for demo and the 100% is not quite realistic
#(95% would be fine too)

Demultiplexing

#The command to run trim2bRAD_2barcodes_dedup.pl is complicated, so
#another script -- 2bRAD_trim_launch_dedup.pl -- is used to generate the command for us:
  
2bRAD_trim_launch_dedup.pl R1.fastq > dedupCommands


#look at the commands file
cat   dedupCommands

#returns our next commands to execute:
trim2bRAD_2barcodes_dedup.pl input=A_CTGCAG_R1.fastq site=".{12}CGA.{6}TGC.{12}|.{12}GCA.{6}TCG.{12}" adaptor="AGATC" sampleID=100
trim2bRAD_2barcodes_dedup.pl input=B_GAAGTT_R1.fastq site=".{12}CGA.{6}TGC.{12}|.{12}GCA.{6}TCG.{12}" adaptor="AGATC" sampleID=100