MultiQC - fastQC summary tool -- GVA2020

Overview

The fastQC tool was presented in the second tutorial on the first day of the class as the go to tool for quality control analysis of fastq files, but there is an underlying issue that checking each fastq file is quite daunting and evaluating each file individually can introduce its own set of artifacts or biases. The MultiQC tool represents a tool which works directly on fastQC reports to quickly generate summary reports to both identify samples that are different among a group and to make global decisions about how to treat a set of files.

Learning Objectives

In this tutorial, we will:

  1. work with some simple bash scripting from the command line (for loops) to generate multiple fastqc reports simultaneously and look at 272 plasmid samples.
  2. work with MultiQC to make decisions about read preprocessing.
  3. identify outlier files that are clearly different from the group as a whole and determine how to deal with these files.


Installing multiqc

Checking installation

You may have installed multiqc on the first day at the end of the intro to linux tutorial, so before jumping into trying to install something you already have access to lets verify you actually need to install the program. 

Check if you already installed multiqc was successfully with these 2 commands.
which multiqc
multiqc

If you have already installed multiqc correctly, the first line should return something that starts with /home1/ then has a number and your user id followed by /.local/bin/multiqc. The second line should tell you that there is an error as you didn't provide an argument for the analysis directory as well as that you are using multiqc version 1.9. If so you can then skip down to getting some data.

Installing using pip3

For a general description of pip3 see the end of the linux tutorial.

Preferred simple installation
pip3 install --user multiqc

*note that the "--user" option in the above code is required while working on LS5 because individual users do not have access to core systems. If you have python3 on your personal computer and wanted to install multiqc (or any other package available through pip) you would typically omit the "--user"  flag.

Installation may take a minute or two depending on your internet connection and you will see several progress bars. Eventually you should see a line that starts with "Successfully installed" and then a long list of packages including multiqc-1.9. The additional packages listed are packages that multiqc will use to generate its figures.

Scroll back up to see the 'which' and 'multiqc' commands used to check if the installation was sucessful.

More complicated invocation that may work in some instances when simple invocation fails
python3 -m pip install --user multiqc

If you still see something else, let me know.


Get some data and load fastqc

Copy the plasmid sequencing files found in the BioITeam directory gva_course/plasmid_qc/ to a new directory named GVA_multiqc. There are 2 main ways to do this particularly since there are so many files (544 total). 

Click here for help with copying the files recursively in a single step
cp -r $BI/gva_course/plasmid_qc/ $SCRATCH/GVA_multiqc
Click here for help with copying the files using a wildcard after making a new directory
mkdir $SCRATCH/GVA_multiqc 
cp $BI/gva_course/plasmid_qc/* $SCRATCH/GVA_multiqc
You may remember from the first tutorial on read preprocessing that fastqc is a module you can load
module load fastqc

Generating FastQC commands

Throughtout the first part of the course we focused on working with a single sample and thus were able to type commands 1 at a time. We further only had a few input files that we were dealing with in an individual tutorial thus tab completion and ls are very useful. Here we are dealing with 544 files which is more than the total number of files we dealt with in all the required tutorials combined, and nobody wants to type out 544 commands 1 at a time. Therefore, we are going to construct a single commands file with 544 lines that we can give to the launcher_creator.py script to launch all commands without having to know the name of any single file.  To do so we will use the bash 'for' command.

Before you get started make sure you are in the correct directory and that you have the 544 files you expect to have
cd $SCRATCH/GVA_multiqc
ls


For loops on the command line have 3 parts:

  1. A list of something to deal with 1 at a time. Followed by a ';'
    1. for f in *.gz; in the following example
  2. Something to do with each item in the list. this must start with the word 'do'
    1. do echo "fastqc -o fastqc_output $f "; in the following example
  3. The word "done" so bash knows to stop looking for more commands.
    1. done in the following example, but we add a final redirect (>) so rather than printing to the screen the output goes to a file (fastqc_commands in this case)


Putting it all together
for f in *.gz; do echo "fastqc -o fastqc_output $f"; done  > fastqc_commands

Use the head and wc -l to see what the output is and how much there is of it respectively.

Next we need to make the output directory for all the fastqc reports to go into and send the fastqc_commands file to the queue to execute.

submit the job to run on the que
mkdir fastqc_output
launcher_creator.py -N 1 -w 48 -n "fastqc" -t 00:15:00 -a "UT-2015-05-18" -j fastqc_commands -m "module load launcher/3.0.3"
sbatch fastqc.slurm

A note about running fastqc on the head node

In Tuesday's class someone asked if fastqc can be run on the head node. The answer was that for the single sample that we were looking at it was fine, but that if we were going to deal with large numbers of samples or total number of reads it was probably not the best idea. You could run this on an idev node, but it would take more work to set it up to actually run all 544 samples at once, so sending them to the queue system is much more efficient.

Further you may recall that there is always a trade off when sending stuff to the queue system that it may not run immediately and waiting around for it to even start running is not a great use of class time. This leads us to the recomendation of working with this tutorial on Wednesday or near the end of the day on Thursday as if you do so, no matter how long it takes to run, you can quickly move through the multiqc command and evaluating the results at the start of Thursday or Friday's Course.


Run MultiQC

Hopefully by now the job we submitted with all the fastqc commands has finished. Use the showq -u command to check. If the showq -u command still shows your job in the 'waiting' section it has not started, once it does start i would expect it to finish in ~4-5 minutes. It is possible that an error occurs and does not finish so we need to check for output from the files.

If you ls fastqc_output directory you are hit in the face with more files and directories than you have seen in any directory during this class. You immediately can notice that there is a directory and a compressed version of each of those directories, but in order to know things worked correctly, we need to make sure that we have 2 files for each of our 544 samples. The easiest way to do that in my opinion is to pipe that output to the wc -l command to count the total number of lines.

check contents and count of those contents
ls fastqc_output; ls fastqc_output | wc -l

My personal use

The above command (and its variants) is something I use fairly commonly when working with large data sets. By putting both commands on a single line separated by a semicolon I get both outputs without having a prompt interrupt the output.

Assuming you see both directories and their associated compressed files,  and a total count of 1088 run the 'multiqc -h' command to look through the options and see if you can figure out how to guild the command. 

Running multiqc on the head node is acceptable
cd fastqc_output
multiqc .

In this case (much as is the case with FastQC) while there are are a reasonable number of options that can be used, none are truly needed for evaluating fastq files. The only requirement is that you specify where the FastQC output that you want to generate a single report for is located. In the example above you changed into the directory containing those results and then specified multiqc should look in you current directory for the files. It would have been comparable to stay in the existing directory, and instead use a command of "multiqc fastqc_output". 

Evaluate MultiQC report

As the multiqc_report.html file is a html file, you will need to transfer it back to your laptop to view it. Hopefully, by now you have learned how to do this without needing the scp tutorial open to help you. If not, consider getting my attention on zoom so i can try to help clear up any confusion you may be having.

Once you have the report back on your local computer, open it and begin looking at it. Unlike the FastQC report, the multiqc report comes with detailed help information for each section you can access with the different ?help icons on the right as well as a video at the top of the page. If anything is not clear, and you'd like help clearing it up, let me know.

Optional Exercise

  1. Using information in the MultiQC report, modify the bash loop used to create the fastqc_commands file above to create a cutadapt_commands file that could modify all 544 files at once.
  2. Move over to the trimmomatic tutorial and come back to trim all adapter sequences from all files and rerun fastqc/multiqc to see what a difference trimming makes on overall quality.



Return to the Genome Variant Analysis Course 2020 Home Page