Page Comparison

...

Using the -c (column) and -o (operation) options, you can have add information added in subsequent fields. Each comma-separated column number following -c specifies a column to operate on, and the corresponding comma-separated function name following the -o specifies the operation to perform on that column in order to produce an additional output field.

...

We can do this with -c 6,4,4 -o distinct,count,collapse, which says that three custom output columns should be added:

the 1st custom column (output column 4) should result from collapsing distinct (unique) values of gene file column 6 (the strand column, + or -)
- since we will ask for stranded merging, the merged regions will always be on the same strand, so this value will always be + or -
the 2nd custom output column should result from counting the gene names in column 4 for all genes that were merged, and
the 3rd custom output should be a comma-separated collapsed list of those same column 4 gene names.

bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):

...

The first few lines of the merged.sc_genes.txt file look like this (I've tidied it up a bit):

Code Block

2-micron        251     1523    +       1       R0010W
2-micron        1886    3008    -       1       R0020C
2-micron        3270    3816    +       1       R0030W
2-micron        5307    6198    -       1       R0040C
chrI            334     792     +       2       YAL069W,YAL068W-A
chrI            1806    2169    -       1       YAL068C
chrI            2479    2707    +       1       YAL067W-A
chrI            7234    9016    -       1       YAL067C
chrI            10090   10399   +       1       YAL066W
chrI            11564   11951   -       1       YAL065C

As we specified:

Output column 4 has the region's strand.
Column 5 is the count of merged regions

...

Column 6 is a collapsed, comma-separated list of the merged gene names

...

Exercise: Compare the number of regions in the merged and before-merge gene files.

...

Expand

title	Answer

Output column 5 has the gene count.

Code Block

language	bash

cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block

language	bash

There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed.gene contributed.

Or being fancy:

Code Block

language	bash

cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n \
  | grep -v ' 1$' | awk 'BEGIN{ct=0}{ct=ct+$1}END{print ct}'

Exercise: Repeat the steps above, but first create a good.sc_genes.bed file that does not include Dubious ORFs.

...

To make a valid BED6 file, we'll include the first 3 output columns of merged.good.sc_genes.txt (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).

...

Code Block

language	bash
title	Start an idev session

idev -m 120 -N 1 -A OTH21164 -r CoreNGS-Fri     # or -A TRA23004
# or
idev -m 90 -N 1 -A OTH21164 -p development  # or -A TRA23004

Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

...

Expand

title	Answer

Code Block

language	bash

module load samtools

cd $SCRATCH/core_ngs/bedtools_multicov
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt

Code Block

title	samtools flagstat output

3347559 + 0 in total (QC-passed reads + QC-failed reads)
24317 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
3347559 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%.

To get the distribution of mapping qualities:

Code Block

language	bash

samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c

Code Block

title	distribution of mapping qualities

To compute average mapping quality:

Code Block

language	bash

samtools view yeast_mrna.sort.filt.bam | awk '
  BEGIN{FS="\t"; sum=0; tot=0}
  {sum = sum + $5; tot = tot + 1}
  END{printf("mapping quality average: %.1f for %d reads\n", sum/tot,tot) }'

Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records.

...

Expand

title	Setup (if needed)

Code Block

language	bash

idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5CoreNGS  # or -A TRA23004
module load biocontainers
module load samtools
module load bedtools

mkdir -p $SCRATCH/core_ngs/bedtools_multicov
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/catchup/bedtools_merge/merged*bed .
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .

...

Expand

title	Answers

Code Block

language	bash

wc -l yeast_mrna_gene_counts.bed

6485 records were written, one for each feature in the merged.sc_genes.bed file.

The overlap count was added as the last field in each output record (. So here it is field 7 , since the input annotation file had 6 columns).

Exercise: How many features have non-zero overlap counts?

...

A signal track is a bedGraph (BED3+) file with an entry for every base in a defined set of regions that shows the count of overlapping bases for the regions (see https://genome.ucsc.edu/goldenpath/help/bedgraph.html). bedGraph files can be visualized in the Broad's IGV (Integrative Genomics Viewer) application (https://software.broadinstitute.org/software/igv/download) or in the UCSC Genome Browser (https://genome.ucsc.edu/).

Go to the UCSC Genome Browser: https://genome.ucsc.edu/
Select Genomes from the top menu bar
Select Human from POPULAR SPECIES
- under Human Assembly select Feb 2009 (GrCh37/hg19)
- select GO
In the hg19 browser page, the Layered H3K27Ac track is a signal track
- the x-axis is the genome position
- the y-axis represents the count of ChIP-seq reads that overlap each position
  - where the ChIP'd protein is H3K27AC (histone H3, acetylated on the Lysine at amino acid position 27)

...

Code Block

language	bash
title	Prepare for bedtools coverage

idev -m 120 -N 1 -A OTH21164 -r CoreNGS-day5     # or -A TRA23004
# or
idev -m 90 -N 1 -A OTH21164 -p development  # or -A TRA23004

module load biocontainers
module load bedtools

mkdir -p $SCRATCH/core_ngs/bedtools_genomecov
cd $SCRATCH/core_ngs/bedtools_genomecov 
cp $CORENGS/catchup/bedtools_merge/merged*bed .
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .

Then calling bedtools genomecov is easy. The -bg option says to report the depth in bedGraph format.

Code Block

language	bash

cd $SCRATCH/core_ngs/bedtools_genomecov
bedtools genomecov -bg -ibam yeast_mrna.sort.filt.bam > yeast_mrna.genomecov.bedGraph

wc -l yeast_mrna.genomecov.bedGraph # 1519274 lines

...

Because this bedGraph file is for the small-ish (12Mb) yeast genome, and for reads that cover only part of that genome, it is not too big – only ~34M. But depending on the species and read depth, bedGraph files can get very large, so there is a coresponding corresponding binary format called bigWig (see https://genome.ucsc.edu/goldenpath/help/bigWig.html). The program to covert a bedGraph file to bigWig format is part of the UCSC Tools suite of programs. Look for it with module spider, and note that you can get information about all the tools in it using module spider with a specific container version:

...

Looking at the help for bedGraphToBigWig, we'll need a file of chromosome sizes. We can create one from our BAM header, using a Perl substitution script, which I prefer to sed(see Tips and tricks#perlpatternsubstitution):

Code Block

language	bash

module load ucsc_tools

cd $SCRATCH/core_ngs/bedtools_genomecov
bedGraphToBigWig  # look at its usage

# create the needed chromosome sizes file from our BAM header
module load samtools
samtools view -H yeast_mrna.sort.filt.bam | grep -P 'SN[:]' | \
  perl -pe 's/.*SN[:]//' | perl -pe 's/LN[:]//' > sc_chrom_sizes.txt

cat sc_chrom_sizes.txt

# displays:
chrI    230218
chrII   813184
chrIII  316620
chrIV   1531933
chrV    576874
chrVI   270161
chrVII  1090940
chrVIII 562643
chrIX   439888
chrX    745751
chrXI   666816
chrXII  1078177
chrXIII 924431
chrXIV  784333
chrXV   1091291
chrXVI  948066
chrM    85779

...

Code Block

language	bash

cd $SCRATCH/core_ngs/bedtools_genomecov
export LC_COLLATE=C  # may or may not need this...
sort -k1,1 -k2,2n yeast_mrna.genomecov.bedGraph > yeast_mrna.genomecov.sorted.bedGraph
bedGraphToBigWig yeast_mrna.genomecov.sorted.bedGraph sc_chrom_sizes.txt yeast_mrna.genomecov.bw

...

Versions Compared

Old Version 87

New Version Current

Key