Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

There is just one data file for genome coverage. Unlike the per-sample files, it has a header, with an arbitrary tag for the categories in the 1st column, then dataset names and their counts in subsequent columns:. (I've re-formatted the data for readability, but remember that all .tsv file data must be tab-separated.)

Code Block
titlecombined_genomecov.tsv
count     5k_nuclei    50k_nuclei
(a) none  2140984435  2140984435 2175228345
2175228345(b) 1-2   237947623  237947623  351105871
351105871(c) 3-10   308665107 308665107   186361275
(d) 11-50 38729079  38729079   17356704
17356704(e) 51-100  3473642     780078
100+   4545530 1071888     39501819579

Here we edit the multiqc_config.yaml configuration file to add appropriate custom data sections:

Code Block
titlemultiqc_config.yaml
# Titles to use for the report.
title: "ATAC-Seq QC Reports"
subtitle: null
intro_text: "MultiQC reports for Igor's ATAC-Seq proof-of-concept project."
report_header_info:
    - Sequenced by: 'GSAF'
    - Job: 'JA17277'
    - Run: 'SA17121'
    - Setup: '2x150'

# Change the output filenames
output_fn_name: mqc_report.html
data_dir_name: mqc_report_data

# Ignore these files / directories / paths when searching for reports
fn_ignore_files:
    - '*.dupinfo.txt'

# Modules that should come at the top of the report
top_modules:
    - 'generalstats'
    - 'fastqc'
    - 'samtools'
    - 'picard'

# --------------------------------
# Custom data
# --------------------------------
custom_content:
  order:
    - bowtie2_isize_section
 custom_data:   -  bowtie2_isize:bowtie2_mapq_section
    - genome_coverage_section
custom_data:
    bowtie2_isize:
        id: 'bowtie2_isize_section'
        section_name: 'Bowtie2 insert size'
        description: 'distribution for alignments (bowtie2 --local -X2000 --no-mixed --no-discordant)'
        file_format: 'tsv'
        plot_type: 'linegraph'
        pconfig:
            id: 'bowtie2_isize_plot'
            title: 'Insert sizes for proper pairs'
            xlab: 'Insert size'
            ylab: 'Count'

sp:
    bowtie2_isize_sectionmapq:
        fnid: '*.bowtie2_isizes.tsv'

x

Expand
titleCatch up

To catch up, just use Anna's pre-made files:

Code Block
languagebash
mkdir -p $SCRATCH/byteclub/multiqc
cd $SCRATCH/byteclub/multiqc
rsync -avrP /work/01063/abattenh/projects/byteclub/multiqc/06_custom_linegraph/ 02_bowtie/

Then the usual...

Code Block
languagebash
cd $SCRATCH/byteclub/multiqc/02_bowtie; rm -rf mqc_report*; multiqc .

Resulting in a report that includes our inset size distribution data the custom data section we configured: http://web.corral.tacc.utexas.edu/iyer/byteclub/multiqc/06_custom_linegraph.mqc_report.html, with a new section called Bowtie2 insert size.

What's cool is that this "sawtooth" insert size distribution occurs because of the way transposons insert into the major groove of DNA at regular intervals. So this graph shows Igor that his ATAC-seq proof-of-concept experiment worked!

Making MultiQC run faster and be less confused

By default, MultiQC scans all files in the analysis directory you specify. This can take quite a while for complex directory hierarchies with many files that will not be used by MultiQC.

Additionally, MultiQC can get confused when the same (or similar) data is found in different files, or in different directories.

To address these issues, it is a good practice to copy everything you want MultiQC to process into a single directory, then either specify just that directory on the multiqc command line (e.g. multiqc for_multiqc), or exclude other directories in the multiqc_config.yaml file.

For example, here we can stage all the reports we want MultiQC to process in our for_multiqc directory:

Code Block
languagebash
cd ~/playtime/multiqc/atacseq/for_multiqc
ln -s -f ../fastqc
cp -p ../bowtie2/*.flagstat.txt  .
cp -p ../bowtie2/*.idxstats.txt  .

Your for_multiqc directory should now contain:

Code Block
brain_50k_nuclei.bowtie2_isizes.tsv
brain_50k_nuclei.dupmetrics.txt
brain_50k_nuclei.flagstat.txt
brain_50k_nuclei.idxstats.txt
brain_5k_nuclei.bowtie2_isizes.tsv
brain_5k_nuclei.dupmetrics.txt
brain_5k_nuclei.flagstat.txt
brain_5k_nuclei.idxstats.txt
fastqc

Then:

Code Block
languagebash
cd ~/playtime/multiqc/atacseq; rm -rf mqc_report*
multiqc for_multiqc

You can also exclude the bowtie2 directory entirely via a fn_ignore_dirs section list item. Our final multiqc_config.yaml file then looks like this:

Code Block
titlemultiqc_config.yaml
# Titles to use for the report.
title: "ATAC-Seq QC Reports"
subtitle: null
intro_text: "MultiQC reports for Igor's ATAC-Seq proof-of-concept project."
report_header_info:
    - Sequenced by: 'GSAF'
    - Job: 'JA17277'
    - Run: 'SA17121'
    - Setup: '2x150'

# Change the output filenames
output_fn_name: mqc_report.html
data_dir_name: mqc_report_data

# Ignore these files / directories / paths when searching for reports
fn_ignore_files:
    - '*.dupinfo.txt'
fn_ignore_dirs:
    - bowtie2

# Modules that should come at the top of the report
top_modules:
    - 'generalstats'
    - 'fastqc'
    - 'samtools'
    - 'picard'

# --------------------------------
# Custom data
# --------------------------------
custom_content:
  order:
    - bowtie2_isize_section
    - iyer_seq_history_section

custom_data:
    bowtie2_isize:
        id: 'bowtie2_isize_section'mapq_section'
        section_name: 'Mapping quality'
        description: 'distribution for aligned reads before filtering'
        file_format: 'tsv'
        plot_type: 'bargraph'
        pconfig:
            id: 'bowtie2_mapq_plot'
            title: 'Mapping quality scores'
            ymax: 60000000
    genome_coverage:
        id: 'genome_coverage_section'
        section_name: 'Genome coverage'
        description: 'of mapped inserts (bedtools genomecov -fs), grouped into coverage count catgories'
        file_format: 'tsv'
        plot_type: 'bargraph'
        pconfig:
            id: 'genome_coverage_plot'
            title: 'Position coverage by coverage count category'
            logswitch: True
            stacking: null
sp:
    bowtie2_isize_section:
        section_namefn: 'Bowtie2 insert size*.bowtie2_isizes.tsv'
        descriptionbowtie2_mapq_section:
'distribution for
alignments (bowtie2 --local -X2000 --no-mixed --no-discordant)'   fn: '*.mapq_histogram.tsv'
    filegenome_coverage_formatsection:
'tsv'         plot_typefn: 'linegraphcombined_genomecov.tsv'
 
# file suffixes to remove when  pconfiggenerating sample names...
extra_fn_clean_exts:
    - type: 'replace'
      idpattern: 'bowtie2_isize_plot.mapq_histogram.tsv'
    - type: 'replace'
      titlepattern: 'Insert sizes for proper pairs'
            xlab: 'Insert size'
            ylab: 'Count'
    iyer_seq_history:
        id: 'iyer_seq_history_section'
        section_name: 'Iyer lab sequencing'
        description: '- history of alignments by type'
        file_format: 'tsv'
        plot_type: 'bargraph'
        pconfig:
            id: 'iyer_seq_history_plot'

sp:
    bowtie2_isize_section:
        fn: '*.bowtie2_isizes.tsv'
    iyer_seq_history_section:
        fn: 'iyer_sequencing_history.tsv'

...

'.genomecov.tsv'
Expand
titleCatch up

To catch up, just use Anna's pre-made files:

Code Block
languagebash
mkdir -p $SCRATCH/byteclub/multiqc
cd $SCRATCH/byteclub/multiqc
rsync -avrP /work/01063/abattenh/projects/byteclub/multiqc/07_custom_bargraph/ 02_bowtie/

Then the usual...

Code Block
languagebash
cd $SCRATCH/byteclub/multiqc/02_bowtie; rm -rf mqc_report*; multiqc .

Resulting in a report that includes our new Mapping quality and Genome coverage sections, that should look like this: http://web.corral.tacc.utexas.edu/iyer/byteclub/multiqc/07_custom_bargraph.mqc_report.html.

Additionally, MultiQC can get confused when the same (or similar) data is found in different files, or in different directories.

To address these issues, it is a good practice to copy everything you want MultiQC to process into a single directory, then either specify just that directory on the multiqc command line (e.g. multiqc for_multiqc), or exclude other directories in the multiqc_config.yaml file.

For example, here we can stage all the reports we want MultiQC to process in our for_multiqc directory:

Code Block
languagebash
cd $SCRATCH/byteclub/multiqc/02_bowtie/for_fastqc
ln -s -f ../fastqc
cp -p ../bowtie2/*.flagstat.txt  .
cp -p ../bowtie2/*.idxstats.txt  .

Your for_multiqc directory should now everything we want MultiQC to use:

Code Block
brain_50k_nuclei.bowtie2_isizes.tsv
brain_50k_nuclei.dupmetrics.txt
brain_50k_nuclei.flagstat.txt
brain_50k_nuclei.idxstats.txt
brain_50k_nuclei.mapq_histogram.tsv
brain_5k_nuclei.bowtie2_isizes.tsv
brain_5k_nuclei.dupmetrics.txt
brain_5k_nuclei.flagstat.txt
brain_5k_nuclei.idxstats.txt
brain_5k_nuclei.mapq_histogram.tsv
combined_genomecov.tsv
fastqc

Then:

Code Block
languagebash
cd ~/playtime/multiqc/atacseq; rm -rf mqc_report*
multiqc for_multiqc
Expand
titleCatch up

To catch up, just use Anna's pre-made files:

Code Block
languagebash
mkdir -p $SCRATCH/byteclub/multiqc
cd $SCRATCH/byteclub/multiqc
rsync -avrP /work/01063/abattenh/projects/byteclub/multiqc/08_final/ 02_bowtie/

Run MultiQC again, but this time just point it 

Code Block
languagebash
cd $SCRATCH/byteclub/multiqc/02_bowtie
rm -rf mqc_report*
multiqc for_multiqc

Alternatively, you could exclude the bowtie2 directory entirely via a fn_ignore_dirs section list item. 

In either case, the final report should look just as it did for the previous section: http://web.corral.tacc.utexas.edu/iyer/byteclub/multiqc/08_final.mqc_report.html.

References

...