Overview
The fastQC tool was presented on the first day of the class as the go to tool for quality control analysis of fastq files, but there is an underlying issue that checking each fastq file is quite daunting and evaluating each file individually can introduce its own set of artifacts or biases. The MultiQC tool represents a tool which works directly on fastQC reports to quickly generate summary reports to both identify samples that are different among a group and to make global decisions about how to treat a set of files.
Learning Objectives
In this tutorial, we will:
- work with some simple bash scripting from the command line (for loops) to generate multiple fastqc reports simultaneously and look at 272 plasmid samples.
- work with MultiQC to make decisions about read preprocessing.
- identify outlier files that are clearly different from the group as a whole and determine how to deal with these files.
Get some data and load fastqc
Copy the plasmid sequencing files found in the BioITeam directory gva_course/plasmid_qc/ to a new directory named GVA_multiqc. There are 2 main ways to do this particularlly since there are so many files (544 total).
Use a bash for loop on the command line to generate a fastQC command for all plasmid samples
We are going to construct a single commands file with 544 lines that will launch all commands without having to know the name of any single file. To do so we will use the bash 'for' command.
For loops on the command line have 3 parts:
- A list of something to deal with 1 at a time. Followed by a ';'
- for f in *.gz; in the following example
- Something to do with each item in the list. this must start with the word 'do'
- do echo "fastqc -o fastqc_output $f &"; in the following example
- The word "done" so bash knows to stop looking for more commands.
- done in the following example, but we add a final redirect (>) so rather than printing to the screen the output goes to a file (fastqc_commands in this case)
Use the linux commands head and wc -l to see what the output is.
Next we need to make the output directory for all the fastqc reports to go and make the fastqc_commands file executable.
mkdir fastqc_output chmod +
Run MultiQC tool on all fastQC output
Evaluate MultiQC report
Optional Exercise
Using information gained in the MultiQC report, modify the bash loop used for the fastQC commands to improve the raw reads.