Page Comparison

...

If all went well, you should now see a mqc_report.html file and a mqc_report_data directory. Your newly-generated mqc_report.html report file in should look like this (note the new title and header): http://web.corral.tacc.utexas.edu/iyer/byteclub/multiqc/02_custom.mqc_report.html.

Tips for working with the MultiQC configuation file

...

First stage some mm10 bowtie2 alignment data:xx

Code Block

language	bash

code

xx

...

mkdir -p $SCRATCH/byteclub/multiqc/02_bowtie
cd $SCRATCH/byteclub/multiqc/02_bowtie
ln -s -f /work/01063/abattenh/projects/byteclub/multiqc/fastqc
rsync -avrP /work/01063/abattenh/projects/byteclub/multiqc/bowtie2/ bowtie2/

Take a look at the contents of the bowtie2 directory. It contains typical output files from running Anna's align_bowtie2_illumina.sh alignment script.

MultiQC will look at all files in this directory looking for report formats it understands. Here, reports that MultiQC will recognize as-is include:

<prefix>.flagstat.txt - output from running samtools flagstat
<prefix>.idxstats.txt - output from running samtools idxstats
<prefix>.dupinfo.txt - output from running Picard MarkDuplicates

Note that output from samtools flagstat and samtools idxstats will only be recognized by MultiQC if the files names include the words flagstat and idxstats. Fortunately, Anna's script created files with those names!

Now run multiqc again using the previous MultiQC configuration created above.

Code Block

language	bash

mkdir -p $SCRATCH/byteclub/multiqc/02_bowtie
cd $SCRATCH/byteclub/multiqc/02_bowtie
cp ../01_fastq/multiqc_config.yaml .
multiqc .

If all went well, you should now see a mqc_report.html file that looks like this: http://web.corral.tacc.utexas.edu/iyer/byteclub/multiqc/03_bowtie.mqc_report.html, with new sections for Picard and Samtools reports.

Fix the Picard MarkDuplicates sample name

Notice there is something odd going on in the new General Statistics section. We see M Reads Mapped entries for samples called brain_50k_nuclei and brain_5k_nuclei, but % Dups entries for samples named brain_50k_nuclei.sort and brain_5k_nuclei.sort.

To see where a General Statistics column comes from, hover over the column header. Doing this tells us that the the M Reads Mapped figures came from the samtools flagstat report, while the % Dups comes from Picard MarkDuplicates.

Take a look at one of the <prefix>.dupinfo.txt files to see what might be going on. Below I've added line breaks to the command line info for clarity.

Code Block

title	brain_5k_nuclei.dupinfo.txt

## htsjdk.samtools.metrics.StringHeader
# picard.sam.markduplicates.MarkDuplicates INPUT=[brain_5k_nuclei.sort.bam] 
OUTPUT=brain_5k_nuclei.sort.dup.bam 
METRICS_FILE=brain_5k_nuclei.dupinfo.txt ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 
SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false 
REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag REMOVE_DUPLICATES=false 
DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES 
PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates 
READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> 
OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 
MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
GA4GH_CLIENT_SECRETS=client_secrets.json
## htsjdk.samtools.metrics.StringHeader
# Started on: Wed Jul 05 23:20:57 CDT 2017

## METRICS CLASS        picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED     SECONDARY_OR_SUPPLEMENTARY_RDS  UNMAPPED_READS       UNPAIRED_READ_DUPLICATES        READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION  ESTIMATED_LIBRARY_SIZE
brain_5k_nuclei 0       28666117        0       16562322        0       12504025        902024  0.436 195     23118972

## HISTOGRAM    java.lang.Double
BIN     VALUE
1.0     1.016471
...

This is a standard metrics file produced by running Picard MarkDuplicates with a METRICS_FILE option specified. Note the INPUT bam filename: brain_5k_nuclei.sort.bam. MultiQC parses this report looking for the INPUT file, and uses it (minus the .bam suffix) as the sample name. So brain_5k_nuclei.sort.bam becomes sample name brain_5k_nuclei.sort.

The solution is to modify the metrics so that it has brain_5k_nuclei.bam as its INPUT option. This sounds straightforward but there are a couple of gocha's.

We don't want to modify the original metrics file, as it is a faithful record of what was executed. So we want to make a copy of the metrics file with the modified INPUT parameter.
We want MultiQC to ignore the original metrics file, and only use our "fixed" metrics log.
- But MultiQC will, by default, look at all reports in the specified analysis directory (and its sub-directories), and will then find two metrics reports.

For the first part of the solution, we'll create a modified version of the metrics files, but not in the alignment directory (after all, here we're using a link to Anna's directory, and we don't want to mess with its contents), but in a new for_multiqc directory.

Code Block

language	bash

mkdir -p ~/playtime/multiqc/atacseq/for_multiqc
cd ~/playtime/multiqc/atacseq/for_multiqc
for f in ../bowtie2/*.dupinfo.txt; do
  bn=`basename $f`
  pfx=${bn%%.dupinfo.txt}
  echo "$f - $pfx"
  cat $f | sed 's/[.]sort//g' > ${pfx}.dupmetrics.txt
done

Executing ls -1 ~/playtime/multiqc/atacseq/for_multiqc should now show 2 files:

Code Block
brain_50k_nuclei.fixed.dupmetrics.txt brain_5k_nuclei.fixed.dupmetrics.txt

The final piece of the puzzle is to tell MultiQC to ignore the original <prefix>.dupinfo.txt files by modifying the multiqc_config.yaml file, adding a fn_ignore_files list entry.

Code Block

title	multiqc_config.yaml

# Titles to use for the report.
title: "ATAC-Seq QC Reports"
subtitle: null
intro_text: "MultiQC reports for Igor's ATAC-Seq proof-of-concept project."
report_header_info:
    - Sequenced by: 'GSAF'
    - Job: 'JA17277'
    - Run: 'SA17121'
    - Setup: '2x150'

# Change the output filenames
output_fn_name: mqc_report.html
data_dir_name: mqc_report_data

# Ignore these files / directories / paths when searching for reports
fn_ignore_files:
    - '*.dupinfo.txt'

After making this config file modification, you can now run multiqc again:

Code Block

language	bash

cd ~/playtime/multiqc/atacseq; rm -rf mqc_report*; multiqc .

The resulting report will look like this: ftp://gapdh.icmb.utexas.edu/misc/multiqc/04.mqc_report.fixed_align_info.html

Controlling report section order

You may have noticed that MultiQC lists its report sections in fairly random order. If you're Type-A like me, you'll want sections ordered to reflect the sequence of analyses performed. This can be controlled by adding a list of top_module entries.

Code Block

title	multiqc_config.yaml

# Titles to use for the report.
title: "ATAC-Seq QC Reports"
subtitle: null
intro_text: "MultiQC reports for Igor's ATAC-Seq proof-of-concept project."
report_header_info:
    - Sequenced by: 'GSAF'
    - Job: 'JA17277'
    - Run: 'SA17121'
    - Setup: '2x150'

# Change the output filenames
output_fn_name: mqc_report.html
data_dir_name: mqc_report_data

# Ignore these files / directories / paths when searching for reports
fn_ignore_files:
    - '*.dupinfo.txt'

# Modules that should come at the top of the report
top_modules:
    - 'generalstats'
    - 'fastqc'
    - 'samtools'
    - 'picard'

And again...

Code Block

language	bash

cd ~/playtime/multiqc/atacseq; rm -rf mqc_report*; multiqc .

Producing a report like this: ftp://gapdh.icmb.utexas.edu/misc/multiqc/05.mqc_report.changed_section_order.html

About MultiQC custom data

When MultiQC does not know about data produced by a program it doesn't know about, it has a mechanisms for adding custom report sections. The simple way to do this is declaratively, (i.e., via configuation parameters) as described below. (You can also write a Python module for very fine-grained control, but that is a lot more work.)

Format the data appropriately
- MultiQC supports a number of data file formats (yaml, comma-separated values, etc.)
  - I recommend using simple tab-delimited text files, with MultiQC's preferred .tsv extension.
- data can be provided as one file per sample (where the sample name is part of the file name)
  - or as a single table-like file containing data for all samples
Add two required custom data section entries in the multiqc_config.yaml configuration file
- a sp (search path) section for finding report data
  - specifying a wildcard pattern, if data is supplied as one file per sample
  - or a single file name for a consolidated data file
- each report has a user-named section under a single custom_data section.
  - the required id attribute must be unique, and ties the custom_data, sp and custom_content sections
  - other important attributes include description, file_format, and plot_type.
  - a pconfig sub-section contains plot configuration options
Specify the ordering of the custom report section (optional)
- add a custom_content section order list entry

See http://multiqc.info/docs/#configuration for more details about the structure of custom data sections in multiqc_config.yaml.

MultiQC supports these custom plot types:

bargraph
linegraph
scatter
table
generalstats
beeswarm
heatmap

Plot-type-specific plotting options can be specified in each custom data report's pconfig section. See http://multiqc.info/docs/#plotting-functions for more information. While this section is written for Python programming, the options listed in each plot type's "config" block can be used in the plot's plot's pconfig section (modulo the Python versus YAML formatting differences).

Adding a custom linegraph

Here we'll create a linegraph report of insert sizes for the bowtie2 alignments.

We start with the <prefix>.insertsz.txt files produced by Anna's align_bowtie2_illumina.sh script. That file has both positive and "negative" insert sizes with read counts for all mates, proper pairs, and R1s and R2s individually. For a data file compatible with MultiQC's linegraph, we want only two columns: the positive insert size and count of proper pairs with that size (negative sizes are redundant since each proper pair will have one positive and one negative size entry of the same magnitude).

So a bit of command line reformatting is needed to produce files for MultiQC, which we will save in our for_multiqc directory.

Code Block

language	bash

mkdir -p ~/playtime/multiqc/atacseq/for_multiqc
cd ~/playtime/multiqc/atacseq/for_multiqc
for f in ../bowtie2/*.insertsz.txt; do
  bn=`basename $f`
  pfx=${bn%%.insertsz.txt}
  echo "$f - $pfx"
  tail -n +2 $f | grep -v -P '^-' | cut -f 1,3 > ${pfx}.bowtie2_isizes.tsv
done

Next we edit the multiqc_config.yaml configuration file to add appropriate custom data sections:

Code Block

title	multiqc_config.yaml

# Titles to use for the report.
title: "ATAC-Seq QC Reports"
subtitle: null
intro_text: "MultiQC reports for Igor's ATAC-Seq proof-of-concept project."
report_header_info:
    - Sequenced by: 'GSAF'
    - Job: 'JA17277'
    - Run: 'SA17121'
    - Setup: '2x150'

# Change the output filenames
output_fn_name: mqc_report.html
data_dir_name: mqc_report_data

# Ignore these files / directories / paths when searching for reports
fn_ignore_files:
    - '*.dupinfo.txt'

# Modules that should come at the top of the report
top_modules:
    - 'generalstats'
    - 'fastqc'
    - 'samtools'
    - 'picard'

# --------------------------------
# Custom data
# --------------------------------
custom_content:
  order:
    - bowtie2_isize_section

custom_data:
    bowtie2_isize:
        id: 'bowtie2_isize_section'
        section_name: 'Bowtie2 insert size'
        description: 'distribution for alignments (bowtie2 --local -X2000 --no-mixed --no-discordant)'
        file_format: 'tsv'
        plot_type: 'linegraph'
        pconfig:
            id: 'bowtie2_isize_plot'
            title: 'Insert sizes for proper pairs'
            xlab: 'Insert size'
            ylab: 'Count'

sp:
    bowtie2_isize_section:
        fn: '*.bowtie2_isizes.tsv'

Then the usual...

Code Block

language	bash

cd ~/playtime/multiqc/atacseq; rm -rf mqc_report*; multiqc .

Resulting in a report that includes our inset size distribution data the custom data section we configured: ftp://gapdh.icmb.utexas.edu/misc/multiqc/06.mqc_report.custom_linegraph.html

References

Main MultiQC links

...

Versions Compared

Old Version 14

New Version 15

Key

Tips for working with the MultiQC configuation file

Fix the Picard MarkDuplicates sample name

Controlling report section order

About MultiQC custom data

Adding a custom linegraph

References

Main MultiQC links