...
Using the -c (column) and -o (operation) options, you can have add information added in subsequent fields. Each comma-separated column number following -c specifies a column to operate on, and the corresponding comma-separated function name following the -o specifies the operation to perform on that column in order to produce an additional output field.
...
We can do this with -c 6,4,4 -o distinct,count,collapse, which says that three custom output columns should be added:
- the 1st custom column (output column 4) should result from collapsing distinct (unique) values of gene file column 6 (the strand column, + or -)
- since we will ask for stranded merging, the merged regions will always be on the same strand, so this value will always be + or -
- the 2nd custom output column should result from counting the gene names in column 4 for all genes that were merged, and
- the 3rd custom output should be a comma-separated collapsed list of those same column 4 gene names.
bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):
...
The first few lines of the merged.sc_genes.txt file look like this (I've tidied it up a bit):
Code Block |
---|
2-micron 251 1523 + 1 R0010W 2-micron 1886 3008 - 1 R0020C 2-micron 3270 3816 + 1 R0030W 2-micron 5307 6198 - 1 R0040C chrI 334 792 + 2 YAL069W,YAL068W-A chrI 1806 2169 - 1 YAL068C chrI 2479 2707 + 1 YAL067W-A chrI 7234 9016 - 1 YAL067C chrI 10090 10399 + 1 YAL066W chrI 11564 11951 - 1 YAL065C |
As we specified:
- Output column 4 has the region's strand.
- Column 5 is the count of merged regions
...
- Column 6 is a collapsed, comma-separated list of the merged gene names
...
Exercise: Compare the number of regions in the merged and before-merge gene files.
...
Expand | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||
Output column 5 has the gene count.
Produces this histogram:
There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed.gene contributed. Or being fancy:
|
Exercise: Repeat the steps above, but first create a good.sc_genes.bed file that does not include Dubious ORFs.
...
To make a valid BED6 file, we'll include the first 3 output columns of merged.good.sc_genes.txt (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).
...
Code Block | ||||
---|---|---|---|---|
| ||||
idev -m 120 -N 1 -A OTH21164 -r CoreNGS-Fri # or -A TRA23004 # or idev -m 90 -N 1 -A OTH21164 -p development # or -A TRA23004 |
Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).
...
Expand | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||
There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%. To get the distribution of mapping qualities:
To compute average mapping quality:
Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records. |
...
Expand | |||||
---|---|---|---|---|---|
| |||||
|
...
Expand | |||||
---|---|---|---|---|---|
| |||||
6485 records were written, one for each feature in the merged.sc_genes.bed file. The overlap count was added as the last field in each output record (. So here it is field 7 , since the input annotation file had 6 columns). |
Exercise: How many features have non-zero overlap counts?
...
A signal track is a bedGraph (BED3+) file with an entry for every base in a defined set of regions that shows the count of overlapping bases for the regions (see https://genome.ucsc.edu/goldenpath/help/bedgraph.html). bedGraph files can be visualized in the Broad's IGV (Integrative Genomics Viewer) application (https://software.broadinstitute.org/software/igv/download) or in the UCSC Genome Browser (https://genome.ucsc.edu/).
- Go to the UCSC Genome Browser: https://genome.ucsc.edu/
- Select Genomes from the top menu bar
- Select Human from POPULAR SPECIES
- under Human Assembly select Feb 2009 (GrCh37/hg19)
- select GO
- In the hg19 browser page, the Layered H3K27Ac track is a signal track
- the x-axis is the genome position
- the y-axis represents the count of ChIP-seq reads that overlap each position
- where the ChIP'd protein is H3K27AC (histone H3, acetylated on the Lysine at amino acid position 27)
...
Code Block | ||||
---|---|---|---|---|
| ||||
idev -m 120 -N 1 -A OTH21164 -r CoreNGS-day5 # or -A TRA23004 # or idev -m 90 -N 1 -A OTH21164 -p development # or -A TRA23004 module load biocontainers module load bedtools mkdir -p $SCRATCH/core_ngs/bedtools_genomecov cd $SCRATCH/core_ngs/bedtools_genomecov cp $CORENGS/catchup/bedtools_merge/merged*bed . cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* . |
Then calling bedtools genomecov is easy. The -bg option says to report the depth in bedGraph format.
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/core_ngs/bedtools_genomecov bedtools genomecov -bg -ibam yeast_mrna.sort.filt.bam > yeast_mrna.genomecov.bedGraph wc -l yeast_mrna.genomecov.bedGraph # 1519274 lines |
...
Because this bedGraph file is for the small-ish (12Mb) yeast genome, and for reads that cover only part of that genome, it is not too big – only ~34M. But depending on the species and read depth, bedGraph files can get very large, so there is a coresponding corresponding binary format called bigWig (see https://genome.ucsc.edu/goldenpath/help/bigWig.html). The program to covert a bedGraph file to bigWig format is part of the UCSC Tools suite of programs. Look for it with module spider, and note that you can get information about all the tools in it using module spider with a specific container version:
...
Looking at the help for bedGraphToBigWig, we'll need a file of chromosome sizes. We can create one from our BAM header, using a Perl substitution script, which I prefer to sed(see Tips and tricks#perlpatternsubstitution):
Code Block | ||
---|---|---|
| ||
module load ucsc_tools cd $SCRATCH/core_ngs/bedtools_genomecov bedGraphToBigWig # look at its usage # create the needed chromosome sizes file from our BAM header module load samtools samtools view -H yeast_mrna.sort.filt.bam | grep -P 'SN[:]' | \ perl -pe 's/.*SN[:]//' | perl -pe 's/LN[:]//' > sc_chrom_sizes.txt cat sc_chrom_sizes.txt # displays: chrI 230218 chrII 813184 chrIII 316620 chrIV 1531933 chrV 576874 chrVI 270161 chrVII 1090940 chrVIII 562643 chrIX 439888 chrX 745751 chrXI 666816 chrXII 1078177 chrXIII 924431 chrXIV 784333 chrXV 1091291 chrXVI 948066 chrM 85779 |
...
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/core_ngs/bedtools_genomecov
export LC_COLLATE=C # may or may not need this...
sort -k1,1 -k2,2n yeast_mrna.genomecov.bedGraph > yeast_mrna.genomecov.sorted.bedGraph
bedGraphToBigWig yeast_mrna.genomecov.sorted.bedGraph sc_chrom_sizes.txt yeast_mrna.genomecov.bw |
...