...
Login to ls6, start and idev session, then load the BioContainers bedtools module, and then check its version.
Code Block | ||||
---|---|---|---|---|
| ||||
idev -m 120 -N 1 -A OTH21164 -r CoreNGS-Fri # or -A TRA23004 # or idev -m 90 -N 1 -A OTH21164 -p development # or -A TRA23004 module load biocontainers module load bedtools bedtools --version # should be bedtools v2.27.1 |
Input format considerations
- Most BEDTools functions now accept either BAM or BED files as input.
- BED format files must be BED3+, or BED6+ if strand-specific operations are requested.
- When comparing against a set of regions, those regions are usually supplied in either BED or GTF/GFF.
- All text-format input files (BED, GTF/GFF, VCF) should use Unix line endings (linefeed only).
...
Which type of RNA-seq library you have depends on the library preparation method – so ask your sequencing center! Our yeast RNA-seq library is sense stranded (; however note that most RNA-seq libraries these days, including ones prepared by GSAF, are antisense stranded).
If you have a stranded RNA-seq library, you should use either -s or -S to avoid false counting against a gene on the wrong strand.
...
- seqname - The name of the chromosome or contig.
- source - Name of the program that generated this feature, or other data source (e.g. public database)
- feature_type - Type of the feature, for example:
- chromosome
- CDS (coding sequence), exon
- gene, transcript
- start_codon, stop_codon
- start - Start position of the feature, with sequence numbering starting at 1.
- end - End position of the feature, with sequence numbering starting at 1.
- score - A numeric value. Often but not always an integer. Meaning differs and not usually important.
- strand - Defined as + (forward), - (reverse), or . (no relevant strand)
- frame - For a CDS, one of 0, 1 or 2, specifying the reading frame of the first base; otherwise '.'
...
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
|
...
One of the first things you want to know about your annotation file is what gene features it contains. Here's how to find that: (Read more about what's going on here at piping a histogram)
Expand | |||||
---|---|---|---|---|---|
| |||||
|
Read more about what's going on here at piping a histogram.
Code Block | ||||
---|---|---|---|---|
| ||||
cd $SCRATCH/core_ngs/bedtools cat sacCer3.R64-1-1_20110208.gff | grep -v '^#' | cut -f 3 | \ sort | uniq -c | sort -k1,1nr | more |
...
Code Block |
---|
chrI 334 649 YAL069W 315 + YAL069W Dubious chrI 537 792 YAL068W-A 255 255 + YAL068W-A Dubious chrI 1806 2169 YAL068C 363 - PAU8 Verified chrI 2479 2707 YAL067W-A 228 + YAL067W-A Uncharacterized chrI 7234 9016 YAL067C 1782 - SEO1 Verified chrI 10090 10399 YAL066W 309 + YAL066W Dubious chrI 11564 11951 YAL065C 387 - YAL065C Uncharacterized chrI 12045 12426 YAL064W-B 381 + YAL064W-B Uncharacterized chrI 13362 13743 YAL064C-A 381 - YAL064C-A Uncharacterized chrI 21565 21850 YAL064W 285 + YAL064W Verified chrI 22394 22685 YAL063C-A 291 - YAL063C-A Uncharacterized chrI 23999 27968 YAL063C 3969 - FLO9 Verified chrI 31566 32940 YAL062W 1374 + GDH3 Verified chrI 33447 34701 YAL061W 1254 + BDH2 Uncharacterized chrI 35154 36303 YAL060W 1149 + BDH1 Verified chrI 36495 36918 YAL059C-A 423 - YAL059C-A Dubious chrI 36508 37147 YAL059W 639 + ECM1 Verified chrI 37463 38972 YAL058W 1509 + CNE1 Verified chrI 38695 39046 YAL056C-A 351 - YAL056C-A Dubious chrI 39258 41901 YAL056W 2643 2643 + GPB2 Verified |
Note that value in the 8th column. In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious. The Dubious ones have no experimental evidence so are generally excluded.
...
Using the -c (column) and -o (operation) options, you can have add information added in subsequent fields. Each comma-separated column number following -c specifies a column to operate on, and the corresponding comma-separated function name following the -o specifies the operation to perform on that column in order to produce an additional output field.
...
We can do this with -c 6,4,4 -o distinct,count,collapse, which says that three custom output columns should be added:
- the 1st custom column (output column 4) should result from collapsing distinct (unique) values of gene file column 6 (the strand column, + or -)
- since we will ask for stranded merging, the merged regions will always be on the same strand, so this value will always be + or -
- the 2nd custom output column should result from counting the gene names in column 4 for all genes that were merged, and
- the 3rd custom output should be a comma-separated collapsed list of those same column 4 gene names.
bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):
...
The first few lines of the merged.sc_genes.txt file look like this (I've tidied it up a bit):
Code Block |
---|
2-micron 251 1523 + 1 R0010W 2-micron 1886 3008 - 1 R0020C 2-micron 3270 3816 + 1 R0030W 2-micron 5307 6198 - 1 R0040C chrI 334 792 + 2 YAL069W,YAL068W-A chrI 1806 2169 - 1 YAL068C chrI 2479 2707 + 1 YAL067W-A chrI 7234 9016 - 1 YAL067C chrI 10090 10399 + 1 YAL066W chrI 11564 11951 - 1 YAL065C |
As we specified:
- Output column 4 has the region's strand.
- Column 5 is the count of merged regions
...
- Column 6 is a collapsed, comma-separated list of the merged gene names
...
Exercise: Compare the number of regions in the merged and before-merge gene files.
...
Expand | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||
Output column 5 has the gene count.
Produces this histogram:
There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed.gene contributed. Or being fancy:
|
Exercise: Repeat the steps above, but first create a good.sc_genes.bed file that does not include Dubious ORFs.
...
To make a valid BED6 file, we'll include the first 3 output columns of merged.good.sc_genes.txt (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).
...
Code Block | ||||
---|---|---|---|---|
| ||||
idev -m 120 -N 1 -A OTH21164 -r CoreNGS-Fri # or -A TRA23004 # or idev -m 90 -N 1 -A OTH21164 -p development # or -A TRA23004 |
Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).
...
Expand | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||
There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%. To get the distribution of mapping qualities:
To compute average mapping quality:
Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records. |
...
Expand | |||||
---|---|---|---|---|---|
| |||||
|
...
Expand | |||||
---|---|---|---|---|---|
| |||||
6485 records were written, one for each feature in the merged.sc_genes.bed file. The overlap count was added as the last field in each output record (. So here it is field 7 , since the input annotation file had 6 columns). |
Exercise: How many features have non-zero overlap counts?
...
A signal track is a bedGraph (BED3+) file with an entry for every base in a defined set of regions that shows the count of overlapping bases for the regions (see https://genome.ucsc.edu/goldenpath/help/bedgraph.html). bedGraph files can be visualized in the Broad's IGV (Integrative Genomics Viewer) application (https://software.broadinstitute.org/software/igv/download) or in the UCSC Genome Browser (https://genome.ucsc.edu/).
- Go to the UCSC Genome Browser: https://genome.ucsc.edu/
- Select Genomes from the top menu bar
- Select Human from POPULAR SPECIES
- under Human Assembly select Feb 2009 (GrCh37/hg19)
- select GO
- In the hg19 browser page, the Layered H3K27Ac track is a signal track
- the x-axis is the genome position
- the y-axis represents the count of ChIP-seq reads that overlap each position
- where the ChIP'd protein is H3K27AC (histone H3, acetylated on the Lysine at amino acid position 27)
...
Code Block | ||||
---|---|---|---|---|
| ||||
idev -m 120 -N 1 -A OTH21164 -r CoreNGS-day5 # or -A TRA23004 # or idev -m 90 -N 1 -A OTH21164 -p development # or -A TRA23004 module load biocontainers module load bedtools mkdir -p $SCRATCH/core_ngs/bedtools_genomecov cd $SCRATCH/core_ngs/bedtools_genomecov cp $CORENGS/catchup/bedtools_merge/merged*bed . cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* . |
Then calling bedtools genomecov is easy. The -bg option says to report the depth in bedGraph format.
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/core_ngs/bedtools_genomecov bedtools genomecov -bg -ibam yeast_mrna.sort.filt.bam > yeast_mrna.genomecov.bedGraph wc -l yeast_mrna.genomecov.bedGraph # 1519274 lines |
...
Because this bedGraph file is for the small-ish (12Mb) yeast genome, and for reads that cover only part of that genome, it is not too big – only ~34M. But depending on the species and read depth, bedGraph files can get very large, so there is a coresponding corresponding binary format called bigWig (see https://genome.ucsc.edu/goldenpath/help/bigWig.html). The program to covert a bedGraph file to bigWig format is part of the UCSC Tools suite of programs. Look for it with module spider, and note that you can get information about all the tools in it using module spider with a specific container version:
...
Looking at the help for bedGraphToBigWig, we'll need a file of chromosome sizes. We can create one from our BAM header, using a Perl substitution script, which I prefer to sed(see Tips and tricks#perlpatternsubstitution):
Code Block | ||
---|---|---|
| ||
module load ucsc_tools cd $SCRATCH/core_ngs/bedtools_genomecov bedGraphToBigWig # look at its usage # create the needed chromosome sizes file from our BAM header module load samtools samtools view -H yeast_mrna.sort.filt.bam | grep -P 'SN[:]' | \ perl -pe 's/.*SN[:]//' | perl -pe 's/LN[:]//' > sc_chrom_sizes.txt cat sc_chrom_sizes.txt # displays: chrI 230218 chrII 813184 chrIII 316620 chrIV 1531933 chrV 576874 chrVI 270161 chrVII 1090940 chrVIII 562643 chrIX 439888 chrX 745751 chrXI 666816 chrXII 1078177 chrXIII 924431 chrXIV 784333 chrXV 1091291 chrXVI 948066 chrM 85779 |
...
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/core_ngs/bedtools_genomecov
export LC_COLLATE=C # may or may not need this...
sort -k1,1 -k2,2n yeast_mrna.genomecov.bedGraph > yeast_mrna.genomecov.sorted.bedGraph
bedGraphToBigWig yeast_mrna.genomecov.sorted.bedGraph sc_chrom_sizes.txt yeast_mrna.genomecov.bw |
...