Overview
Learning Objecties
This turorial introduces you to long read fastq files generated from oxford nanopore data, and compares such data to short read Illumina data in light of what you have learned throughout the course. After completing this tutorial you should:
- Be able to make high level comparisons of long read and short read data
- Be able to Interrogate long read data
- Trim data (as appropriate)
- Filter data (as appropriate)
Conda
Depending on what optional tutorials you have worked on, you may have created several different tutorials which contain the several of the programs used later in this tutorial. As environments are set around programs not around data, it should make sense that you can use the same environments that you created for short reads with long reads. This does not mean that the "best" tool for short reads is the "best" tool for long reads or even "operational" with long reads and vice versa. This tutorial is written assuming you create the following new environment, you can of course skip this and change out your environment as needed for whatever parts of the tutorial you are interested in trying for yourself.
conda create --name GVA-ReadQC-long -c conda-forge -c bioconda fastqc seqfu cutadapt porechop filtlong
Get some data
Possible errors on idev nodes
As mentioned throughout the course, you can not copy from the BioITeam (because it is on corral-repl) while on an idev node. Logout of your idev session, copy the files.
The files downloaded are from a single sample and are the raw files provided after on-instrument calling in an mk1c instrument. In the next section we will begin to interrogate them.
Basic data handling
This may be the only place in the course where file compression options are explicitly discussed. There are multiple different ways that files can be compressed though throughout the course, we only work with gzip. This allows us to take advantage of zgrep (as above) but also the ability to quickly combine gzipped files. You may be able to think of uses for this with paired end reads if you have sequenced the same sample on multiple runs. Be careful when doing so, as most programs that use paired end information only do so by comparing line by line between the paired end files, not actually checking the header information for pairs. Other compression programs (zip, bzip2, xz) do not offer this functionality.
In the case of long read sequencing, we are working with single end sequencing and while there may be reasons to keep these partial files, quality assessment is more logical if done on the entire data set. Additionally, long read sequencing is single ended, so the order that the files appear in the combined file does not actually matter.
A quick reminder on interrupting console
Recall that "control + c" will stop whatever command you are currently running. This is mentioned here to highlight the importance of the ">" mark in the next code block.
cd $SCRATCH/GVA_nanopore/ cat raw_reads/*.gz > barcode01.combined.fastq.gz
Quality Assessment
Just like when you were first introduced to short read fastq files, it is very common to want to quickly get first impressions of the data you are working with. Again, we will use fastQC but also seqfu which gives additional metrics of our file.