Use bedtools coverage to create a signal track
A signal track is a bedGraph (BED3+) file with an entry for every base in a defined set of regions (see https://genome.ucsc.edu/goldenpath/help/bedgraph.html). bedGraph files can be visualized in the Broad's IGV (Integrative Genomics Viewer) application (https://software.broadinstitute.org/software/igv/download) or in the UCSC Genome Browser (https://genome.ucsc.edu/).
The bedtools coverage function (https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html), with the -d (per-base depth) option produces output that can be made into a bedGraph. Here we'll analyze the per-base coverage of yeast RNAseq reads in our merged yeast gene regions.
Make sure you're in an idev session, then prepare a directory for this exercise.
Code Block |
---|
language | bash |
---|
title | Prepare for bedtools coverage |
---|
|
idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5
module load biocontainers
module load bedtools
mkdir -p $SCRATCH/core_ngs/bedtools_coverage
cp $CORENGS/catchup/bedtools_merge/merged*bed $SCRATCH/core_ngs/bedtools_coverage/
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* $SCRATCH/core_ngs/bedtools_coverage/ |
Then calling bedtools coverage is easy. The "A" file will be our gene regions, and the "B" file will be the yeast RNAseq reads. We also use the -d (per-base depth) and -s (force "strandedness") options.
Code Block |
---|
|
cds; cd core_ngs/bedtools_coverage
bedtools coverage -s -d -a merged.good.sc_genes.bed -b yeast_mrna.sort.filt.bam > yeast_mrna.gene_coverage.txt
wc -l yeast_mrna.gene_coverage.txt # 8,829,317 lines! |
It will complain a bit because our genes file includes the yeast plasmid "2-micron" but the RNAseq BAM doesn't include that contig. We'll ignore that warning.
The bedtools coverage output is a bit strange. It lists each region in the A file, followed by information from the B reads. Here the column order will be gene_chrom gene_start gene_end gene_name gene_score gene_strand offset_in_the_gene_region read_overlap count.
Let's look at coverage for gene YAL067C:
Code Block |
---|
|
cat yeast_mrna.gene_coverage.txt | grep -P 'YAL067C' | head -50 |
Will look like this:
Code Block |
---|
chrI 7234 9016 YAL067C 1 - 1 0
chrI 7234 9016 YAL067C 1 - 2 0
chrI 7234 9016 YAL067C 1 - 3 0
chrI 7234 9016 YAL067C 1 - 4 0
chrI 7234 9016 YAL067C 1 - 5 0
chrI 7234 9016 YAL067C 1 - 6 0
chrI 7234 9016 YAL067C 1 - 7 0
chrI 7234 9016 YAL067C 1 - 8 0
chrI 7234 9016 YAL067C 1 - 9 0
chrI 7234 9016 YAL067C 1 - 10 0
chrI 7234 9016 YAL067C 1 - 11 0
chrI 7234 9016 YAL067C 1 - 12 0
chrI 7234 9016 YAL067C 1 - 13 0
chrI 7234 9016 YAL067C 1 - 14 0
chrI 7234 9016 YAL067C 1 - 15 0
chrI 7234 9016 YAL067C 1 - 16 0
chrI 7234 9016 YAL067C 1 - 17 1
chrI 7234 9016 YAL067C 1 - 18 1
chrI 7234 9016 YAL067C 1 - 19 1
chrI 7234 9016 YAL067C 1 - 20 1
chrI 7234 9016 YAL067C 1 - 21 1
chrI 7234 9016 YAL067C 1 - 22 1
chrI 7234 9016 YAL067C 1 - 23 1
chrI 7234 9016 YAL067C 1 - 24 1
chrI 7234 9016 YAL067C 1 - 25 1
chrI 7234 9016 YAL067C 1 - 26 1
chrI 7234 9016 YAL067C 1 - 27 1
chrI 7234 9016 YAL067C 1 - 28 1
chrI 7234 9016 YAL067C 1 - 29 1
chrI 7234 9016 YAL067C 1 - 30 1
chrI 7234 9016 YAL067C 1 - 31 1
chrI 7234 9016 YAL067C 1 - 32 1
chrI 7234 9016 YAL067C 1 - 33 1
chrI 7234 9016 YAL067C 1 - 34 1
chrI 7234 9016 YAL067C 1 - 35 1
chrI 7234 9016 YAL067C 1 - 36 1
chrI 7234 9016 YAL067C 1 - 37 1
chrI 7234 9016 YAL067C 1 - 38 2
chrI 7234 9016 YAL067C 1 - 39 2
chrI 7234 9016 YAL067C 1 - 40 2
chrI 7234 9016 YAL067C 1 - 41 3
chrI 7234 9016 YAL067C 1 - 42 3
chrI 7234 9016 YAL067C 1 - 43 3
chrI 7234 9016 YAL067C 1 - 44 3
chrI 7234 9016 YAL067C 1 - 45 4
chrI 7234 9016 YAL067C 1 - 46 4
chrI 7234 9016 YAL067C 1 - 47 4
chrI 7234 9016 YAL067C 1 - 48 4
chrI 7234 9016 YAL067C 1 - 49 4
chrI 7234 9016 YAL067C 1 - 50 4 |
A proper bedGraph file has only 4 columns: chrom start end value and does not need to include positions with 0 reads, so we'll convert the bedtools coverage output to bedGraph using awk. We re-sort the output so that plus and minus strand positions are adjacent.
Code Block |
---|
|
cat yeast_mrna.gene_coverage.txt | awk '
BEGIN{FS=OFS="\t"}
{if ($8>0) {print $1,$2-1+$7,$2+$7,$8}}' | \
sort -k1,1 -k2,2n -k3,3n > yeast_mrna.gene_coverage.almost.bedGraph
wc -l yeast_mrna.gene_coverage.almost.bedGraph # 5,710,186 -- better, but still big |
While we probably could consider this file to have bedGraph format, it's preferable to combine adjacent per-base coordinates with the same count into larger regions, e.g.
Code Block |
---|
# per-base counts
chrI 7271 7272 2
chrI 7272 7273 2
chrI 7273 7274 2
chrI 7274 7275 3
chrI 7275 7276 3
chrI 7276 7277 3
chrI 7277 7278 3
# corresponding region counts
chrI 7271 7274 6
chrI 7274 7278 12 |
Here's some awk to do this:
Code Block |
---|
|
cat yeast_mrna.gene_coverage.almost.bedGraph | awk '
BEGIN{FS=OFS="\t"; chr=""; start=-1; end=-1; count=0}
{if (chr != $1) { # new contig; finish previous
if (count > 0) { print chr,start,end,count }
chr=$1; start=$2; end=$3; count=$4
} else if (($2==end || $2==end+1) && ($4==count)) { # same or adjacent position with same count
end=$3;
} else { # new region on same contig; finish prev
if (count > 0) { print chr,start,end,count}
start=$2; end=$3; count=$4
}
}
END{ # finish last
if (count > 0) { print chr,start,end,count }
}' > yeast_mrna.gene_coverage.bedGraph
wc -l yeast_mrna.gene_coverage.bedGraph # 1,048,510 -- much better! |
Make sure the total counts match!
Code Block |
---|
|
cat yeast_mrna.gene_coverage.txt | awk '
BEGIN{tot=0}{tot=tot+$8}END{print tot}' # should be 86703686
cat yeast_mrna.gene_coverage.almost.bed | awk '
BEGIN{tot=0}{tot=tot+$4}END{print tot}' # should also be 86703686
cat yeast_mrna.gene_coverage.bedGraph | awk '
BEGIN{tot=0}{tot=tot+$4*($3-$2)}END{print tot}' # should also be 86703686 |
Now our yeast_mrna.gene_coverage.bedGraph file is a proper bedGraph, whose first lines look like this:
Code Block |
---|
chrI 7250 7271 1
chrI 7271 7274 2
chrI 7274 7278 3
chrI 7278 7310 4
chrI 7310 7317 3
chrI 7317 7349 2
chrI 7349 7353 1
chrI 7500 7556 1
chrI 8851 8891 1
chrI 11919 11951 1 |
x
A brief introduction to bedtools
...
Expand |
---|
title | click here to see the code and output |
---|
|
Code Block |
---|
language | bash |
---|
title | solution code |
---|
| module load bedtools
bedtools bamtofastq -i yeast_pairedend_sort.mapped.q1.bam -fq yeast_pairedend_sort.mapped.q1.fastq #takes 1-2 minutes
more yeast_pairedend_sort.mapped.q1.fastq |
Here is an example of two sequences (and their corresponding quality scores):
Code Block |
---|
language | bash |
---|
title | two lines of a fastq file |
---|
| @HWI-ST1097:127:C0W5VACXX:5:2212:10568:79659
TACCCTCCAATTACCCATATCCAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATAT
+
CCCFFFFFHHHHHJJJJJIIJJJJIJJIJJIJJIIIIJJJIJJIJJIJJIJJJJJJJJJJIIGGIGEGAEHFEFFEFFFDEEE@CCEDCDDD>ACBBDCA@
@HWI-ST1097:127:C0W5VACXX:5:2115:19940:13862
TAGGGTAAGTTTGAGATGGTATATACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAA
+
?B@DF2ADHFHHFHJIIIGCIHIGGIJJJJGHIIIGIJEHHIGGGAHEGGFGHIECGIJIIIJIIIIIJJJJJJE>EHDHEEEBCDD?CDDBDDDDDDCDB |
As we discussed earlier, the top line is the identifier for the sequence produced, the second line defines which bases were produced, the third line indicates the strand the sequence is aligned to, and the fourth line indicates the ASCII based quality scores for each character in the second line. |
...
When we originally examined the bed files produced from our BAM file, we can see many reads that overlap over the same interval. While this level of detail is useful, for some analyses, we can collapse each read into a single line, and indicate how many reads occurred over that genomic interval. We can accomplish this using bedtools merge.
Code Block |
---|
|
bedtools merge [OPTIONS] -i experiment.bed > experiment.merge.bed |
Bedtools merge also directs the output to standard out, to make sure to point the output to a file or a program. While we haven't discussed the options for each bedtools function in detail, here they are very important. Many of the options define what to do with each column (-c) of the output (-o). This defines what type of operation to perform on each column, and in what order to output the columns. Standard bed6 format is chrom, start, stop, name, score, strand and controlling column operations allows you to control what to put into each column of output. The valid operations defined by the -o operation are as follows:
Expand |
---|
title | valid operations using the -o option |
---|
|
- sum, min, max, absmin, absmax,
- mean, median,
- collapse (i.e., print a delimited list (duplicates allowed)),
- distinct (i.e., print a delimited list (NO duplicates allowed)),
- count
- count_distinct (i.e., a count of the unique values in the column)
|
For this exercise, we'll be summing the number of reads over a region to get a score column, using distinct to choose a name, and using distinct again to keep track of the strand. For the -c options, define which columns to operate on, in the order you want the output. In this case, to keep the standard bed format, we'll list as -c 5,4,6 and -o count_distinct,sum,distinct, to keep the proper order of name, score, strand. Strandedness can also be controlled with merge, using the -s option.
...
Code Block |
---|
language | bash |
---|
title | using awk for column reordering |
---|
|
#after the creation of the first file
cat yeast_pairedend_sort.mapped.q1.merge.bed | awk '{print $1 "\t" $2 "\t" $3 "\t" $5 "\t" $6 "\t" $4}' > yeast_pairedend_sort.mapped.q1.merge.reorder.bed
#piped in-line
bedtools merge -s -c 4,5 -o count_distinct,sum -i yeast_pairedend_sort.mapped.q1.bed | awk '{print $1 "\t" $2 "\t" $3 "\t" $5 "\t" $6 "\t" $4}' > yeast_pairedend_sort.mapped.q1.merge.bed |
...
Expand |
---|
title | click here to see my output on copying files |
---|
|
Code Block |
---|
| -rwxrwxr-x 1 awh394 G-801021 1.9M May 22 20:41 gencode.v19.genes.sort.merge.final
-rwxrwxr-x 1 awh394 G-801021 646K May 22 20:41 gencode.v19.proteincoding.genes.sort.merge.final |
|
...
Code Block |
---|
language | bash |
---|
title | copy some gencode files over |
---|
|
cd $SCRATCH/core_ngs
mkdir subtract
cd subtract
cp /scratch/01786/awh394/core_ngs.test/closest/gencode.v19.proteincoding.genes.sort.merge.final .
cp /scratch/01786/awh394/core_ngs.test/closest/gencode.v19.genes.sort.merge.final . |
...
Expand |
---|
language | bash |
---|
title | click here for the bedtools subtract code and output |
---|
|
My output is commented in this code block. Code Block |
---|
| cd subtract
module load bedtools #if you haven't loaded it up yet this session
bedtools subtract -a gencode.v19.genes.sort.merge.final -b gencode.v19.proteincoding.genes.sort.merge.final > gencode.v19.not.notproteincodingproteincoding.genes.bed
wc -l gencode.v19.not.proteincoding.genes.bed
#23483 gencode.v19.not.notproteincodingproteincoding.genes.bed
more gencode.v19.not.proteincoding.genes.bed
#chr1 11869 14412 DDX11L1 . +
#chr1 14363 29806 WASH7P . -
#chr1 29554 31109 MIR1302-11 . +
#chr1 34554 36081 FAM138A . -
#chr1 52473 54936 OR4G4P . +
#chr1 62948 63887 OR4G11P . + |
|
...
We can do this filtering on the hg19_rnaseq_mirnaseq_intersect.bed
file we just created using bedtools intersect.
Code Block |
---|
|
cd $SCRATCH/core_ngs/intersect/
cat hg19_rnaseq_mirnaseq_intersect.bed | awk 'BEGIN{FS="\t";OFS="\t";}{if ($6 == "+"){print}}' | more
|
You could also insist on columns 6 and 12 both being the plus strand as such:
Code Block |
---|
|
cd $SCRATCH/core_ngs/intersect/
cat hg19_rnaseq_mirnaseq_intersect.bed | awk 'BEGIN{FS="\t";OFS="\t";}{if ($6 == "+" && $12 == "+"){print}}' | more
|
...