Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
titlesolution
collapsetrue
#navigate to the directory
cd process_2bRAD


#look at our reads
ls *fastq.gz


#decompress them
gunzip *.gz


#We have two groups, A and B, that are labeled with
#barcode sequences CTGCAG and GAAGTT respectively.
#These groups represent pools of samples that all had
#the same barcode (attached during PCR step in library prep)
#Each group was has two files labeled with L004 and L005 indicating lanes 4 and 5 from the sequencing run
#concatenate the lane 4 and lane 5 data for each pool
cat A_CTGCAG_L004_R1_001.fastq A_CTGCAG_L005_R1_001.fastq > A_CTGCAG_R1cat.fastq
cat B_GAAGTT_L004_R1_001.fastq B_GAAGTT_L005_R1_001.fastq > B_GAAGTT_R1cat.fastq


#Look at the barcode data to get an idea of what these fastq files are
#This is the same information that would be uploaded to GSAF when the
#samples were submitted for sequencing.
less barcode_data.tsv

...

Code Block
titlesolution
collapsetrue
#This library was prepared with bcgl restriction enzyme
#Check that the cut site appears in the reads
#How many reads do we have?
expr $(cat A_CTGCAG_R1cat.fastq | wc -l) / 4
expr $(cat B_GAAGTT_R1cat.fastq | wc -l) / 4
    #670378#426295
	#649445#444053
 
#(Note these are small for demo purposes)

#How many of the reads have the cut site?
grep "CGA......TGC" A_CTGCAG_R1cat.fastq | wc -l
grep "CGA......TGC" B_GAAGTT_R1cat.fastq | wc -l
    #342813#216159
	#333921#220686
  
#3429158#216159/670378426295 = 0.51
#220686/444053 = 0.50

Why is the cut site only in half our reads?

Code Block
titlesolution
collapsetrue
#In the 2bRAD library prep the adapters can be ligated to the fragment in either
#fragment in either orientation (note the mirrored sticky ends left by bcgl).
#So the genomic fragment may have been read from either direction
  
#check for reverse complement of cut site
grep "GCA......TCG" A_CTGCAG_R1cat.fastq | wc -l
grep "GCA......TCG" B_GAAGTT_R1cat.fastq | wc -l
    #331339#201574
	#319385#202908

#(Note216159 these are doctored for demo and the 100% is not quite realistic
#(95% would be fine too)+ 201574)/426295 = 0.98
#(220686 + 202908)/444053 = 0.95

Demultiplexing

Code Block
collapsetrue
#The command to run trim2bRAD_2barcodes_dedup.pl is complicated, so
#another script -- 2bRAD_trim_launch_dedup.pl -- is used to generate the command for us:
  
2bRAD_trim_launch_dedup.pl R1cat.fastq > dedupCommands


#look at the commands file
cat   dedupCommands

#returns our next commands to execute:
trim2bRAD_2barcodes_dedup.pl input=A_CTGCAG_R1.fastq site=".{12}CGA.{6}TGC.{12}|.{12}GCA.{6}TCG.{12}" adaptor="AGATC" sampleID=100
trim2bRAD_2barcodes_dedup.pl input=B_GAAGTT_R1.fastq site=".{12}CGA.{6}TGC.{12}|.{12}GCA.{6}TCG.{12}" adaptor="AGATC" sampleID=100

Do we have all the samples we expect?

Code Block
titlesolution
collapsetrue
#The resulting files are labled with their barcode combinations and a *tr0 suffix
ls -l A*tr0
ls -l B*tr0
  
#How many did we get?
ls -l A*tr0 | wc -l
ls -l B*tr0 | wc -l
  
#Looking back at our barcodes, does this make sense?
cat barcode_data.tsv


#4 samples total
#two from group A with pcr_barcode CTGCAG
#two from group B with pcr_barcode GAAGTT

Which sample is B_cat_ACCA.tr0?

Code Block
titlesolution
collapsetrue
#search for a line in the barcode file with both the barcodes
grep TGGT barcode_data.tsv | grep GAAGTT
  
    #MCNA4
  
#Why search for TGGT instead of ACCA?
#The ligation barcode on 3Illbc was TGGT, but was then read as ACCA.
#When recording which barcode is which, easier to record the antiBC sequence
#(see 2bRAD library preparation protocol):
https://github.com/z0on/2bRAD_denovo/blob/master/2bRAD_protocol_june1_2018.pdf