Overview

Learning Objecties

This turorial introduces you to long read fastq files generated from oxford nanopore data, and compares such data to short read Illumina data in light of what you have learned throughout the course. After completing this tutorial you should:

Be able to make high level comparisons of long read and short read data
Be able to Interrogate long read data
Trim data (as appropriate)
Filter data (as appropriate)

Conda

Depending on what optional tutorials you have worked on, you may have created several different tutorials which contain the several of the programs used later in this tutorial. As environments are set around programs not around data, it should make sense that you can use the same environments that you created for short reads with long reads. This does not mean that the "best" tool for short reads is the "best" tool for long reads or even "operational" with long reads and vice versa. This tutorial is written assuming you create the following new environment, you can of course skip this and change out your environment as needed for whatever parts of the tutorial you are interested in trying for yourself.

conda create --name GVA-ReadQC-long -c conda-forge -c bioconda fastqc seqfu cutadapt porechop filtlong

Get some data

Possible errors on idev nodes

As mentioned throughout the course, you can not copy from the BioITeam (because it is on corral-repl) while on an idev node. Logout of your idev session, copy the files.

There are a set of reads in the $BI/gva_course/nanopore/raw_reads/ directory, download them to a new folder located at $SCRATCH/GVA_nanopore/raw_reads. Click to expand to see example commands

Recursively copy entire directory

mkdir $SCRATCH/GVA_nanopore
cp -r $BI/gva_course/nanopore/raw_reads/ $SCRATCH/GVA_nanopore/raw_reads

Use wildcards to copy entire directory contents

mkdir -p $SCRATCH/GVA_nanopore/raw_reads
cp $BI/gva_course/nanopore/raw_reads/* $SCRATCH/GVA_nanopore/raw_reads

The choice between the 2 boxes is a quetion of style and preference, as you continue on you may find yourself favoring one over the other.

The files downloaded are from a single sample and are the raw files provided after on-instrument calling in an mk1c instrument. In the next section we will begin to interrogate them.

Basic data handling

How many files were downloaded? (use 'ls' command)

34

You can tell this by

looking at the largest numbered file name (FAL60196_pass_barcode01_6e61df78_46d0b948_33.fastq.gz) and noticing that numbering starts with "0"
using a command like: ls | wc -l

How reads are in each file? (use "zgrep" command or decompress the file and use commands such as "wc -l")

4,000 OR 952

An command might look like: zgrep -c "^+$" *.gz

The highest numbered file has a read count of 952 while the rest have exactly 4,000 reads, this may do the following:

make you suspicious when working with data from another source that you are missing some if the read counts are a multiple of 4,000
be useful as a way of filtering reads (discussed later in the tutorial)

This may be the only place in the course where file compression options are explicitly discussed. There are multiple different ways that files can be compressed though throughout the course, we only work with gzip. This allows us to take advantage of zgrep (as above) but also the ability to quickly combine gzipped files. You may be able to think of uses for this with paired end reads if you have sequenced the same sample on multiple runs. Be careful when doing so, as most programs that use paired end information only do so by comparing line by line between the paired end files, not actually checking the header information for pairs. Other compression programs (zip, bzip2, xz) do not offer this functionality.

In the case of long read sequencing, we are working with single end sequencing and while there may be reasons to keep these partial files, quality assessment is more logical if done on the entire data set. Additionally, long read sequencing is single ended, so the order that the files appear in the combined file does not actually matter.

A quick reminder on interrupting console

Recall that "control + c" will stop whatever command you are currently running. This is mentioned here to highlight the importance of the ">" mark in the next code block.

cd $SCRATCH/GVA_nanopore/
cat raw_reads/*.gz > barcode01.combined.fastq.gz

How reads are in each file? (use "zgrep" command or decompress the file and use commands such as "wc -l")

132,952

An command might look like: zgrep -c "^+$" barcode01.combined.fastq.gz

The highest numbered file has a read count of 952 while the rest have exactly 4,000 reads, this may do the following:

make you suspicious when working with data from another source that you are missing some if the read counts are a multiple of 4,000
be useful as a way of filtering reads (discussed later in the tutorial)

Quality Assessment

Just like when you were first introduced to short read fastq files, it is very common to want to quickly get first impressions of the data you are working with. Again, we will use fastQC but also seqfu which gives additional metrics of our file.

Overview

Learning Objecties

Conda

Get some data

Basic data handling

Quality Assessment

fastQC

Seqfu

Quality Control

Adapter Trimming

Filtering

Nanopore Reads GVA2023

Overview

Learning Objecties

Conda

Get some data

Basic data handling

Quality Assessment

fastQC

Seqfu

Quality Control

Adapter Trimming

Filtering