ddRAD data processing

Quality check the reads

Navigate to exercise location

cd process_ddRAD/

Investigate reads. We paid the sequencing facility for 50,000 paired end reads.

How can be confirm we got this?

solution

#we have two fastq files
ls -lh *.fastq


#take a look at the top of the files
less Lib01_R1.fastq


#how many lines are in each fastq file?
wc -l Lib01*.fastq


#look at how many reads there are (note that fastq generally files have 4 lines per read)
expr $(cat Lib01_R1.fastq | wc -l) / 4
expr $(cat Lib01_R2.fastq | wc -l) / 4

Paired end read files should have matching names read for read.

Double-check this is case for your reads

solution

#this can be checked many ways, here is one option:


#search for lines that start with @ symbol (the read definition lines in our fastqs)
#and save the top 10 of them as a text file.
#Repeat for the R2 reads.
#Then line them up


grep "^@" Lib01_R1.fastq | head > first_10_R1_read_names.txt
grep "^@" Lib01_R2.fastq | head > first_10_R2_read_names.txt
paste first_10_R1_read_names.txt first_10_R2_read_names.txt

What if you wanted to repeat the above solution to check the bottom of the files as well?

solution

grep "^@" Lib01_R1.fastq | tail > last_10_R1_read_names.txt
grep "^@" Lib01_R2.fastq | tail > last_10_R2_read_names.txt
paste last_10_R1_read_names.txt last_10_R2_read_names.txt

Based on the library preparation, we expect that our forward reads were cut with the restriction enzyme nlaIII (NLA3).

We expect that the vast majority of our reads should have the recognition sequence in them.

How can we check that our reads largely fit this expectation?

solution

#Looking up nlaIII, we see it's cut site:
#     CATG'
#    'GTAC
  
#check if we see this cut site in the forward reads
grep CATG Lib01_R1.fastq | wc -l


#compare with total read count
expr $(cat Lib01_R1.fastq | wc -l) / 4

Next run the program FastQC on the reads

FastQC

module load fastqc
mkdir raw_Fastqc_Restults
fastqc -o raw_Fastqc_Restults/ -f fastq Lib01_R*.fastq


#scp the resulting .html files to your personal computer to see

The reads appear to tail off in quality toward the ends. This can be fixed with a program called cutadapt.

cutadapt and re-check quality

#run cutadapt without any adapter sequences to cut, but with a PHRED quality cutoff of 30
#(a PHRED score of 30 indicates an expected error rate of 1/1000)
cutadapt --pair-filter=any -q 30 -o Lib01_R1_qced.fastq -p Lib01_R2_qced.fastq Lib01_R1.fastq Lib01_R2.fastq


#now rerun FastQC to see how it worked
mkdir qced_Fastqc_Restults
fastqc -o qced_Fastqc_Restults/ -f fastq Lib01_R*_qced.fastq

Demultiplexing

Now demultiplex the reads. We will do this with process_radtags from the packages STACKS.

demultiplexing

#look at the set of barcodes included for our samples
less barcodes_Lib1.tsv 


#make a directory to put the resulting sample fastq files into
mkdir sample_fastqs
 
#look at the documentation for process_radtags
./process_radtags -h
  
#execute the command to process the rad data
./process_radtags -i 'fastq' -1 Lib01_R1.fastq -2 Lib01_R2.fastq -o ./sample_fastqs/ -b barcodes_Lib1.tsv --inline_index -e 'nlaIII' -r --disable_rad_check

From our barcode file, we expected to have four barcoded samples in our library.

Did we get all of our samples back?

solution

#move to the output directory we used when demultiplexing
cd sample_fastqs/


#look at size of our output files
ls -lh *.fq


#remove the empty *.rem.* files
rm *.rem.*


#check the number of paired end files we have
ls *.1.fq | wc -l
ls *.2.fq | wc -l


#Are the paired end files for each sample the same size?
wc -l *.fq

Mapping

Now we're ready to map the reads to a reference genome

Mapping

#check out our reference
less stickleback_chrom3.fasta 


#how many sequences are there?
grep "^>" stickleback_chrom3.fasta | wc -l


#index the reference
module load bwa
bwa index stickleback_chrom3.fasta


#run mapping
bwa mem stickleback_chrom3.fasta sample_fastqs/sample_AACCA-AGCGAC.1.fq sample_fastqs/sample_AACCA-AGCGAC.2.fq > sampleA.sam
bwa mem stickleback_chrom3.fasta sample_fastqs/sample_CGATC-AGCGAC.1.fq sample_fastqs/sample_CGATC-AGCGAC.2.fq > sampleB.sam
bwa mem stickleback_chrom3.fasta sample_fastqs/sample_GCATG-AGCGAC.1.fq sample_fastqs/sample_GCATG-AGCGAC.2.fq > sampleC.sam
bwa mem stickleback_chrom3.fasta sample_fastqs/sample_TCGAT-AGCGAC.1.fq sample_fastqs/sample_TCGAT-AGCGAC.2.fq > sampleD.sam


#assess mapping efficiency
samtools flagstat sampleA.sam
samtools flagstat sampleB.sam
samtools flagstat sampleC.sam
samtools flagstat sampleD.sam


#these results are not spectacular.
#80% mapping efficiency is OK, but the low rate of properly paired reads is disapointing.


#convert the sam files to bam files
samtools sort sampleA.sam -O SAM -o sampleA.bam
samtools sort sampleB.sam -O SAM -o sampleB.bam
samtools sort sampleC.sam -O SAM -o sampleC.bam
samtools sort sampleD.sam -O SAM -o sampleD.bam