Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Processing ddRAD Quality check the reads

Navigate to exercise location

...

Investigate reads. We paid the sequencing facility for 50,000 paired end reads.

1. How can be confirm we got this?

...

Paired end read files should have matching names read for read.

2. Double-check this is case for your reads

...

We expect that the vast majority of our reads should have the recognition sequence in them.

3. How can we check that our reads largely fit this expectation?

...

Code Block
titlecutadapt and re-check quality
collapsetrue
#run cutadapt without any adapter sequences to cut, but with a PHRED quality cutoff of 30
#(a PHRED score of 30 indicates an expected error rate of 1/1000)
cutadapt --pair-filter=any -q 30 -o Lib01_R1_qced.fastq -p Lib01_R2_qced.fastq Lib01_R1.fastq Lib01_R2.fastq


#now rerun FastQC to see how it worked
mkdir qced_Fastqc_Restults
fastqc -o qced_Fastqc_Restults/ -f fastq Lib01_R*_qced.fastq

Demultiplexing

Now demultiplex the reads. We will do this with process_radtags from the packages STACKS.

Code Block
titledemultiplexing
collapsetrue
#look at the set of barcodes included for our samples
cat barcodes_Lib1.tsv 


#check the reads
grep CATG Lib01_R1.fastq | head


#check where these are in our reads
grep "^TCGAT" Lib01_R1.fastq | wc -l
grep "^CGATC" Lib01_R1.fastq | wc -l
grep "^AACCA" Lib01_R1.fastq | wc -l
grep "^GCATG" Lib01_R1.fastq | wc -l

#make a directory to put the resulting sample fastq files into
mkdir sample_fastqs
 
#look at the documentation for process_radtags
./process_radtags -h
  
#execute the command to process the rad data
./process_radtags -i 'fastq' -1 Lib01_R1.fastq -2 Lib01_R2.fastq -o ./sample_fastqs/ -b barcodes_Lib1.tsv --inline_index -e 'nlaIII' -r --disable_rad_check

From our barcode file, we expected to have four barcoded samples in our library.

Did we get all of our samples back?

Code Block
titlesolution
collapsetrue
#move to the output directory we used when demultiplexing
cd sample_fastqs/


#look at size of our output files
ls -lh *.fq


#remove the empty *.rem.* files
rm *.rem.*


#check the number of paired end files we have
ls *.1.fq | wc -l
ls *.2.fq | wc -l


#Are the paired end files for each sample the same size?
wc -l *.fq

Mapping

Now we're ready to map the reads to a reference genome

Code Block
titleMapping
collapsetrue
#check out our reference
less stickleback_chrom3.fasta 


#how many sequences are there?
grep "^>" stickleback_chrom3.fasta | wc -l


#index the reference
module load bwa
bwa index stickleback_chrom3.fasta


#run mapping
bwa mem stickleback_chrom3.fasta sample_fastqs/sample_AACCA-AGCGAC.1.fq sample_fastqs/sample_AACCA-AGCGAC.2.fq > sampleA.sam
bwa mem stickleback_chrom3.fasta sample_fastqs/sample_CGATC-AGCGAC.1.fq sample_fastqs/sample_CGATC-AGCGAC.2.fq > sampleB.sam
bwa mem stickleback_chrom3.fasta sample_fastqs/sample_GCATG-AGCGAC.1.fq sample_fastqs/sample_GCATG-AGCGAC.2.fq > sampleC.sam
bwa mem stickleback_chrom3.fasta sample_fastqs/sample_TCGAT-AGCGAC.1.fq sample_fastqs/sample_TCGAT-AGCGAC.2.fq > sampleD.sam


#assess mapping efficiency
module load samtools
samtools flagstat sampleA.sam > flagStatRes_sampleA.txt
samtools flagstat sampleB.sam > flagStatRes_sampleB.txt
samtools flagstat sampleC.sam > flagStatRes_sampleC.txt
samtools flagstat sampleD.sam > flagStatRes_sampleD.txt


#these results are not spectacular.
#80% mapping efficiency is OK, but the low rate of properly paired reads is dissapointing.


#convert the sam files to bam files
samtools sort sampleA.sam -O BAM -o sampleA.bam
samtools sort sampleB.sam -O BAM -o sampleB.bam
samtools sort sampleC.sam -O BAM -o sampleC.bam
samtools sort sampleD.sam -O BAM -o sampleD.bam

Genotyping

Now genotype the samples with mpileup.

Code Block
titlempileup
collapsetrue
#index the reference for samtools
module load samtools
samtools faidx stickleback_chrom3.fasta
 
#make a list of the bam files
ls *.bam > my_bamfiles.txt
 
#run mpileup
samtools mpileup -f stickleback_chrom3.fasta -u -b my_bamfiles.txt > mpileup_results.bcf


#now call genotypes from the mpileup results
bcftools call -vmO v -o raw_calls.vcf mpileup_results.bcf

Quality filter the variant calls:

Code Block
titlequality filtering
collapsetrue
#how many raw genotyped sites to be have?
vcftools --vcf raw_calls.vcf


      #After filtering, kept 4 out of 4 Individuals
      #After filtering, kept 14559 out of a possible 14559 Sites


#now perform basic quality filtering on raw calls
bcftools filter --exclude 'QUAL < 30' raw_calls.vcf | bcftools view > filt0.vcf
 
#check the new quality checked vcf
vcftools --vcf filt0.vcf


#Additionally, you can filter on representation among samples, allele frequency, and other options with VCFtools


#Filter for biallelic sites variants 
#genotyped in 75% of samples,
#with at least two reads covering the site


vcftools --vcf filt0.vcf --max-alleles 2 --max-missing 0.75 --minDP 2 --recode --out filt2

Complete.