MultiQC - fastQC summary tool -- GVA2023
Overview
The fastQC tool was presented in the second tutorial on the first day of the class as the go to tool for quality control analysis of fastq files, but there is an underlying issue that checking each fastq file is quite daunting and evaluating each file individually can introduce its own set of artifacts or biases. The MultiQC tool represents a tool which works directly on fastQC reports to quickly generate summary reports to both identify samples that are different among a group and to make global decisions about how to treat a set of files.
Learning Objectives
In this tutorial, we will:
- work with some simple bash scripting from the command line (for loops) to generate multiple fastqc reports simultaneously and look at 272 plasmid samples.
- work with MultiQC to make decisions about read preprocessing.
- identify outlier files that are clearly different from the group as a whole and determine how to deal with these files.
Installing multiqc
Hopefully by now you have had enough experience installing packages via conda that it is second nature to think of going to https://anaconda.org/ searching for multiqc, finding https://anaconda.org/bioconda/multiqc, and seeing that it recommends a command of conda install -c bioconda multiqc
suggesting you can use this command and not worry about anything else. Instead, if you go to the multiqc homepage you might (or probably not, disappointingly, it is not very visible) that next to their conda installation instruction is a small warning icon which takes you to a page about setting up bioconda channels as a prerequisite to installing multiqc. Friday's class will go into detail about managing channels like this and it is not suggested that you follow the commands on that page.
This section details what you would have ff you attempt to use the conda command found on the anaconda page this is what you would see. Note that this being in red means you shouldn't type these commands, but it won't actually hurt if you do.
conda create --name GVA-ReadQC -c bioconda multiqc fastqc conda activate GVA-ReadQC fastqc --version multiqc --version
You would expect this to return "0.12.1" and "1.14 respectively for the fastqc and multiqc versions, but instead you would see the following for multiqc:
/work2/01821/ded/stampede2/miniconda3/envs/GVA-ReadQC/lib/python2.7/site-packages/multiqc-1.0.dev0-py2.7.egg/multiqc/utils/config.py:44: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. configs = yaml.load(f) /work2/01821/ded/stampede2/miniconda3/envs/GVA-ReadQC/lib/python2.7/site-packages/multiqc-1.0.dev0-py2.7.egg/multiqc/utils/config.py:50: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. sp = yaml.load(f) multiqc, version 1.0.dev0
First, the version lists "dev" in the tittle which typically means it is a temporary development version. Second, we get warnings about pyyaml loader being unsafe and depreciated likely as a matter of safety. Based on this, this version is untested in the tutorial, but the real question is how do we get the versions we want and expect.
conda create -n GVA-ReadQC -c conda-forge -c bioconda multiqc fastqc conda activate GVA-RadQC
The goal of presenting the "incorrect version" way is 3 fold:
- To suggest that there is actually minimal to no oversight or testing of the conda commands you may find.
- That different programs may have different expectations of what channels you have access to by default (anaconda seemingly expects access to conda-forge since it isn't listed in their command, while multiqc expects both conda-forge and bioconda by default as their command lists nothing but an obscure link to a page describing the defaults.
- To demonstrate how you can use error messages to figure out what might be going on.
Get some data and verify access to fastqc
Copy the plasmid sequencing files found in the BioITeam directory gva_course/plasmid_qc/ to a new directory named GVA_multiqc. There are 2 main ways to do this particularly since there are so many files (544 total).
Generating FastQC analysis
Here we present 2 different options for performing fastqc analysis on all 500+ samples. Given the very small size of these plasmid sequencing files, the second option on the idev node is probably a better choice. Before skipping down to it I suggest reading through the first option and at least generating the "fastqc_commands" file as in your own work you are likely to work with larger numbers of large fastq files, which will make option 1 the better choice.
A note about running fastqc on the head node
Previously, people have asked if fastqc can be run on the head node. The answer is that for a single sample it is usually fine, but that if we were going to deal with large numbers of samples or total number of reads it was probably not the best idea.
Option 1: job queue system
Throughout the first part of the course we focused on working with a single sample and thus were able to type commands 1 at a time. We further only had a few input files that we were dealing with in an individual tutorial thus tab completion and ls are very useful. Here we are dealing with 544 files which is more than the total number of files we dealt with in all the required tutorials combined, and nobody wants to type out 544 commands 1 at a time. Therefore, we are going to construct a single commands file with 544 lines that we can use to launch all commands without having to know the name of any single file. To do so we will use the bash 'for' command.
For loops on the command line have 3 parts:
- A list of something to deal with 1 at a time. Followed by a ';'
- for f in *.gz; in the following example
- Something to do with each item in the list. this must start with the word 'do'
- do echo "fastqc -o fastqc_output $f "; in the following example
- The word "done" so bash knows to stop looking for more commands.
- done in the following example, but we add a final redirect (>) so rather than printing to the screen the output goes to a file (fastqc_commands in this case)
Use the head
and wc -l
to see what the output is and how much there is of it respectively.
Next we need to make the output directory for all the fastqc reports to go into and send the fastqc_commands file to the queue to execute. Like our breseq tutorial, this involves the use of a .slurm file.
mkdir fastqc_output cp /corral-repl/utexas/BioITeam/gva_course/GVA.launcher.slurm fastqc.slurm nano fastqc.slurm
Again while in nano you will edit most of the same lines you edited in the in the breseq tutorial. Note that most of these lines have additional text to the right of the line. This commented text is present to help remind you what goes on each line, leaving it alone will not hurt anything, removing it may make it more difficult for you to remember what the purpose of the line is
Line number | As is | To be |
---|---|---|
16 | #SBATCH -J jobName | #SBATCH -J multiqc |
17 | #SBATCH -n 1 | #SBATCH -n 48 |
21 | #SBATCH -t 12:00:00 | #SBATCH -t 0:20:00 |
22 | ##SBATCH --mail-user=ADD | #SBATCH --mail-user=<YourEmailAddress> |
23 | ##SBATCH --mail-type=all | #SBATCH --mail-type=all |
29 | export LAUNCHER_JOB_FILE=commands | export LAUNCHER_JOB_FILE=fastqc_commands |
The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.
Again use ctl-o and ctl-x to save the file and exit.
conda activate GVA-ReadQC sbatch fastqc.slurm
Option 2: idev node
As mentioned above, we do not want this on the head node. Make sure you are on an idev node. Please get my attention if you do not know how to do this at this point, or if you don't know how to check if you are.
If you look at the fastqc -h
options you may notice that there is an option for -t
to specify multiple threads and that multiple fastq files can be supplied to a single command.
fastqc -t 48 -o fastqc_output/ *.gz
Using both the * wildcard, and 48 threads, analysis of many samples are initiated at the same time making the output somewhat difficult to read, but significantly increasing the speed at which the samples get analyzed.
Run MultiQC
If you ls fastqc_output
directory you are hit in the face with more files and directories than you have seen in any directory during this class. You immediately can notice that there is a directory and a compressed version of each of those directories, but in order to know things worked correctly, we need to make sure that we have 2 files for each of our 544 samples. The easiest way to do that in my opinion is to pipe that output to the wc -l
command to count the total number of lines.
ls fastqc_output; ls fastqc_output | wc -l
My personal use
Assuming you see both directories and their associated compressed files, and a total count of 1088 run the 'multiqc -h' command to look through the options and see if you can figure out how to guild the command.
cd fastqc_output multiqc .
In this case (much as is the case with FastQC) while there are are a reasonable number of options that can be used, none are truly needed for evaluating fastq files. The only requirement is that you specify where the FastQC output that you want to generate a single report for is located. In the example above you changed into the directory containing those results and then specified multiqc should look in you current directory for the files. It would have been comparable to stay in the existing directory, and instead use a command of "multiqc fastqc_output".
Your final product is an html file named: multiqc_report.html
Evaluate MultiQC report
As the multiqc_report.html file is a html file, you will need to transfer it back to your laptop to view it. Hopefully, by now you have learned how to do this without needing the scp tutorial open to help you. If not, consider getting my attention on zoom so i can try to help clear up any confusion you may be having.
Once you have the report back on your local computer, open it and begin looking at it. Unlike the FastQC report, the multiqc report comes with detailed help information for each section you can access with the different ?help icons on the right as well as a video at the top of the page. If anything is not clear, and you'd like help clearing it up, let me know.
Optional Exercise
- Using information in the MultiQC report, modify the bash loop used to create the fastqc_commands file above to create a cutadapt_commands file that could modify all 544 files at once.
- Move over to the fasp tutorial and come back to trim all adapter sequences from all files and rerun fastqc/multiqc to see what a difference trimming makes on overall quality.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.