Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleSurprising result?

I expect you will see something like this.

No Format
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                                                                                                                

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

This is probably the least informative error message we have seen in this class thus far, as it does not tell us what packages are actually causing a problem, and also suggests that it is a specific conflict.

If we further investigate the multiqc homepage. Rather than recommending the "conda install -c bioconda multiqc" command listed on the anaconda page, it instead recommends: "conda install -c bioconda -c conda-forge multiqc". Using this command brings us back to it wanting to upgrade a bunch of packages, including concerningly:


No Format
  openssl              pkgs/main::openssl-1.0.2u-h7b6447c_0 --> conda-forge::openssl-1.1.1k-h7f98852_0

As this was the dependent package that caused so much trouble with samtools and bcftools, we are better off not proceeding

As we have seen several times now, when we have difficulties getting programs installed together, and need them to interact with each other, the easiest solution is often to create a new environment, specifying all desired packages at the same time.

Code Block
languagebash
titleKeeping with our naming convention we used for the breseq environment, we'll call our new environment GVA-multiqc
conda create --name GVA-multiqc -c bioconda -c conda-forge multiqc fastqc
conda activate GVA-multiqc
Expand
titleWhat versions of fastqc and multiqc does this install?
Code Block
languagebash
fastqc --version
multiqc --version

Returns "FastQC v0.11.9" and "multiqc, version 1.10.1" respectively. 



Info

Some may be interested to compare this years installation instructions to the pip3 installation instructions provided last year: MultiQC - fastQC summary tool -- GVA2020#Installingusingpip3 and with more detailed information: Linux and Lonestar 5 Setup -- GVA2020#pip

...

Code Block
languagebash
titleClick here for help with copying the files using a wildcard after making a new directory
collapsetrue
mkdir $SCRATCH/GVA_multiqc 
cp $BI/gva_course/plasmid_qc/* $SCRATCH/GVA_multiqc
Code Block
languagebash
titleYou may remember from our Monday tutorials that we installed fastqc via conda in our GVA2021 environment. Make sure you have access to it still.
collapsetrue
fastqc --version

...


Generating FastQC analysis

Here we present 2 different options for performing fastqc analysis on all 500+ samples. Given the very small size of these plasmid sequencing files, the second option on the idev node is probably a better choice. Before skipping down to it I suggest reading through the first option and at least generating the "fastqc_commands" file as in your own work you are likely to work with larger numbers of large fastq files, which will make option 1 the better choice.

Info
titleA note about running fastqc on the head node

Previously, people have asked if fastqc can be run on the head node. The answer is that for a single sample it is usually fine, but that if we were going to deal with large numbers of samples or total number of reads it was probably not the best idea.

Option 1: job queue system

Throughout the first part of the course we focused on working with a single sample and thus were able to type commands 1 at a time. We further only had a few input files that we were dealing with in an individual tutorial thus tab completion and ls are very useful. Here we are dealing with 544 files which is more than the total number of files we dealt with in all the required tutorials combined, and nobody wants to type out 544 commands 1 at a time. Therefore, we are going to construct a single commands file with 544 lines that we can use to launch all commands without having to know the name of any single file.  To do so we will use the bash 'for' command.

...

Code Block
languagebash
titlesubmit the job to run on the que
mkdir fastqc_output
cp /corral-repl/utexas/BioITeam/gva_course/GVA2021.launcher.slurm multiqcfastqc.slurm
nano multiqc.slurm

Again while in nano you will edit most of the same lines you edited in the in the breseq tutorial. Note that most of these lines have additional text to the right of the line. This commented text is present to help remind you what goes on each line, leaving it alone will not hurt anything, removing it may make it more difficult for you to remember what the purpose of the line is

Line numberAs isTo be
16

#SBATCH -J jobName

#SBATCH -J multiqc
17

#SBATCH -n 1

#SBATCH -n 68

21

#SBATCH -t 12:00:00

#SBATCH -t

4

0:

00

20:00

22

##SBATCH --mail-user=ADD

#SBATCH --mail-user=<YourEmailAddress>

23

##SBATCH --mail-type=all

#SBATCH --mail-type=all

27

conda activate GVA2021

conda activate

GVA2021

GVA-multiqc

31

export LAUNCHER_JOB_FILE=commands

export LAUNCHER_JOB_FILE=breseq_commands

The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.

Line 27 does not change this time as we are still working with the GVA2021 environment. It is included in the table to help remind you of its importance, and needing to list the correct environment on this line. assumes you named your multiqc environment GVA-multiqc at the beginning of this tutorial.

Again use ctl-o and ctl-x to save the file and exit.

...

Code Block
languagebash
titlesubmit the job to run on the que
mkdir fastqc_output

sbatch fastqc.slurm
Info
titleA note about running fastqc on the head node

In Tuesday's class someone asked if fastqc can be run on the head node. The answer was that for the single sample that we were looking at it was fine, but that if we were going to deal with large numbers of samples or total number of reads it was probably not the best idea. You could run this on an idev node, but it would take more work to set it up to actually run all 544 samples at once, so sending them to the queue system is much more efficient.

Further you may recall that there is always a trade off when sending stuff to the queue system that it may not run immediately and waiting around for it to even start running is not a great use of class time. This leads us to the recommendation of working with this tutorial on Wednesday or near the end of the day on Thursday as if you do so, no matter how long it takes to run, you can quickly move through the multiqc command and evaluating the results at the start of Thursday or Friday's Course.

Run MultiQC

Hopefully by now the job we submitted with all the fastqc commands has finished. Use the showq -u command to check. If the showq -u command still shows your job in the 'waiting' section it has not started, once it does start i would expect it to finish in ~4-5 minutes. It is possible that an error occurs and does not finish so we need to check for output from the files.

...


Option 2: idev node

Warning

As mentioned above, we do not want this on the head node. Make sure you are on an idev node. Please get my attention if you do not know how to do this at this point, or if you don't know how to check if you are.

If you look at the fastqc -h options you may notice that there is an option for -t to specify multiple threads and that multiple fastq files can be supplied to a single command. 

Code Block
languagebash
titleThis allows a single command to quickly analyze all samples
fastqc -t 68 -o fastqc_output/ *.gz

Using both the * wildcard, and what we are considering the optimal 68 threads, analysis of many samples are initiated at the same time making the output somewhat difficult to read, but significantly increasing the speed at which the samples get analyzed.


Run MultiQC

If you ls fastqc_output directory you are hit in the face with more files and directories than you have seen in any directory during this class. You immediately can notice that there is a directory and a compressed version of each of those directories, but in order to know things worked correctly, we need to make sure that we have 2 files for each of our 544 samples. The easiest way to do that in my opinion is to pipe that output to the wc -l command to count the total number of lines.

...

As the multiqc_report.html file is a html file, you will need to transfer it back to your laptop to view it. Hopefully, by now you have learned how to do this without needing the scp tutorial open to help you. If not, consider getting my attention on zoom so i can try to help clear up any confusion you may be having.

...

  1. Using information in the MultiQC report, modify the bash loop used to create the fastqc_commands file above to create a cutadapt_commands file that could modify all 544 files at once.
  2. Move over to the trimmomatic tutorial and come back to trim all adapter sequences from all files and rerun fastqc/multiqc to see what a difference trimming makes on overall quality.



Return to the Genome Variant Analysis Course 2020 2021 Home Page