Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Overview

As discussed throughout the class, long reads are recently becoming cheaper and more accurate making them more available and more useful for NGS applications. This tutorial is designed to build on the same principles (though using some different tools in some cases) to familiarize your self with, and improve a set of long reads.

Learning Objecties

This turorial introduces you to long read fastq files generated from oxford nanopore data, and compares such data to short read Illumina data in light of what you have learned throughout the course. After completing this tutorial you should:

...

Expand
titleHow reads are in each the combined file? (use "zgrep" command or decompress the file and use commands such as "wc -l")

132,952


An command might look like:zgrep -c "^+$"  barcode01.combined.fastq.gz

The highest numbered file has a read count of 952 while the rest have exactly 4,000 reads, this may do the following:

  • make you suspicious when working with data from another source that you are missing some if the read counts are a multiple of 4,000
  • be useful as a way of filtering reads (discussed later in the tutorial)

    Quality Assessment 

    Just like when you were first introduced to short read fastq files, it is very common to want to quickly get first impressions of the data you are working with. Again, we will use fastQC but also seqfu which gives additional metrics of our file.

    fastQC

    Code Block
    titleNotice that even though there are only ~130k reads, fastqc takes much longer to run than it did on our bacterial samples as the total bases in the file are higher
    fastqc barcode01.combined.fastq.gz

    After transferring the barcode01.combined_fastqc.html file back to your computer, being at least a little familiar with short read data, you may have questions such as:

    1. Why are the quality scores so much lower?
      1. the quality scores associated with nanopore are simply different than that of illumina, so the "good, ok, bad" notation still doesn't help.
      2. https://pubmed.ncbi.nlm.nih.gov/34597327/ goes into much more detail on this.
    2. Why is per base sequence content so blocky at start before smoothing out and getting blocky near the end again?
      1. Nanopore barcoding is done as the first bp of each read, so if anything it is surprising that the first bases are not even higher frequency.
    3. Why does the per sequence gc content have a left shoulder?
      1. I expect this is caused by the gc content of the adapter sequence present on all reads being higher than the organism overall and 100% of the reads having the adapter sequence as the first bases, additionally the nature of a transposon generated library means a large number of reads also have the same barcode sequence on their 3' terminal end for those reads which are completely sequenced by the pore.
    4. Why does the length distribution look backwards?
      1. Illumina's sequencing by synthesis design means that there are only enough reagents available to sequence a given number of cycles (bases) so there is a maximum length
      2. Nanopore is all about the fragment length and the fragment transversing the pore so fewer reads will be as long. In the next section you will determine what the minimum read length is to be considered 'passing filters' and thus included in the output
    5. How can there be no overrepresented sequences if 100% of the reads are supposed to start with the same barcode sequence, and why it not identify the adapter sequence?
      1. In both cases (like with quality score determination in 1.) this is related to fastqc having expecations centered around illumina sequencing. Interestingly, you may notice a slight tick up in polyA line this suggests that failed pores may automatically read as "A". 

    Seqfu

    If you installed seqfu in a new conda environment at the start of this tutorial you can also use it to look at some overall sequencing stats. In my own work, I find this particularly useful for comparing across samples and trying to correlate assembly or variant calling outcomes with these metrics. Luckily the command is very simple and with a single sample it does not take too long.

    Code Block
    titleThis same command can be used to check the distributions of each of the 4k read files, or on the combined file depending on which directory you are in.
    seqfu stats -n -v *.gz --csv

    For the combined reads from this we can see:

    1. The minimum length to pass filter is probably 100bp.
      1. We only actually detect reads at 104 and above, but 104 as a cut off makes no sense.
    2. The largest read is ~103kb, about 30x larger than the average read.
    3. The information regarding % of reads which comprise a given fraction of the total length as discussed in the presentation.
      1. Again, AuN is likely something you aren't familiar with but https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity is a good resource for begining t undertand it. 

    Quality Control

    Given the information you have about the data set, there are a few things that can be done to improve the data set. Take a moment to consider what we saw above in the fastqc and seqfu results to consider how you might do this before continuing.

    Adapter Trimming

    Ultimately, this is something that is not routinely done and does not appear to interfere with data analysis (citations needed). Unfortunately, our favorite trimming tool fastp does not work on nanopore data, nor do other programs like trimmomatic.

    porechop

    https://github.com/rrwick/Porechop is a tool specifically developed and used for removing adapter sequences from nanopore libraries. Unfortunately can be quite slow and is abandoned by the creator for difficulties in keeping it up and running as you can see on the github page. That being said the tool is functional and it does work. It also provides an option to remove reads which detect the presence of a different barcode sequence in the middle of it which seems like a safer bet. As mentioned, this command can take quite a bit of time to complete consider modifying the following command to work only on one of the 

    Code Block
    languagebash
    titleSuggested command
    porechop -t 48 -i barcode01.combined.trimmed.fastq -o barcode01.combined.adapter-trimmed.fastq.gz --discard_middle

    From the output you can notice the following about what adapters are present

    No Format
    Trimming adapters from read ends
      Rapid_adapter: GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA
           BC01_rev: CACAAAGACACCGACAACTTTCTT
               BC01: AAGAAAGTTGTCGGTGTCTTTGTG
           BC02_rev: ACAGACGACTACAAACGGAATCGA
               BC02: TCGATTCCGTTTGTAGTCGTCTGT
               BC01: AAGAAAGTTGTCGGTGTCTTTGTG
           BC01_rev: CACAAAGACACCGACAACTTTCTT
               BC02: TCGATTCCGTTTGTAGTCGTCTGT
           BC02_rev: ACAGACGACTACAAACGGAATCGA

    Obviously there is some concern about the presence of barcode 1 and barcode 2 sequences in a single sample. I don't have an explanation for how this can happen either relating to the library prep, nor how it happens with the sequencer splitting the reads based on the presence of this signal, but suspect it may be related to the middle adapter situation. Reads with an adapter in the middle of the read are quite rare:

    No Format
    2,459 / 132,952 reads were discarded based on middle adapters

    In total more than 14 Mbp were removed:

    No Format
    132,715 / 132,952 reads had adapters trimmed from their start (13,594,659 bp removed)
     19,374 / 132,952 reads had adapters trimmed from their end (708,769 bp removed)



    Cutadapt

    As mentioned in the fastqc results, the first few bases are directly related to the barcode sequence present, and as mentioned in many other tutorials when you can get rid of these artifacts, its usually a good idea. In the initial readqc tutorial we used cutadapt to remove low quality bases from the 3prime end of the read, but it can just as easily be used to remove bases from the 5prime end. 

    Code Block
    titleCan you figure out what option you need to do this using cutadapt --help? (don't forget you also want to the output to go somewhere other than your terminal
    collapsetrue
    cutadapt -o barcode01.combined.trimmed.fastq -u 9 barcode01.combined.fastq


    Expand
    titleHow many bases were removed?

    ~1.2M based on:

    Total basepairs processed:   438,208,826 bp

    Total written (filtered):    437,012,258 bp (99.7%)


    Filtering

    As mentioned in several of the presentations, you dont need super high coverage for most applications (and when you do need high coverage you often need to compete with error rates). Therefore it is often beneficial to use only the best of the best reads. filtlong (https://github.com/rrwick/Filtlong) is a program that allows you to set various thresholds to get these divisions. Those who are extra observant this is the same developer as the one who created porechop.

    What filtering needs to be done will largely be dependent on the downstream applications, and a single dataset may even be filtered different ways for different phases of the analysis (ie only the longest reads used to assemble, while the highest quality are used to polish an assembly removing as many SNV as possible). One of the benefits of this program beyond its capabilities is that the github page is extremely descriptive allowing easy understanding of the commands and what you need to modify. For now focus on the example command for having no reference available https://github.com/rrwick/Filtlong#example-commands-detailed. Since we are under 500 Mbp, you should ignore that option. 

    Code Block
    titleWhat command did you come up with?
    collapsetrue
    filtlong --min_length 1000 --keep_percent 90 barcode01.combined.fastq | gzip > barcode01.combined.min1k.top90.fastq.gz


    Expand
    titleHow many reads remain after this filtering?

    83,189 seqfu also shows an increase of ~50% in average read length.


    Next Steps

    Consider trimming in different ways, or making more or less stringent filtered sets, and moving onto the next tutorials.


    Return to the GVA 2023 course page