Page Comparison

...

Expand

title	click here to see my output on copying files

Code Block

language	bash

-rwxrwxrwxrwxrwxr-x 1 awh394 G-801021 1.9M May 1722 0920:3041 gencode.v19.genes.sort.merge.final
-rwxrwxr-rwxrwxrwxx 1 awh394 G-801021 646K May 1722 0920:3041 gencode.v19.proteincoding.genes.sort.merge.final

Exercise 6: Identify the closest protein coding genes (on the same strand) for the human rnaseq file using closest, then sort by the distance column (largest, then smallest distance first).

Expand

language	bash
title	click here for the bedtools closest code and output

My output is commented in this code block.

Code Block

language	bash

cd intersectclosest 
module load bedtools #if you haven't loaded it up yet this session
sort -k1,1 -k2,2n human_rnaseq_bwa_sort.mapped.q1.merge.bed > human_rnaseq_bwa.mapped.q1.merge.sort.bed #need to sort both files to the same order
bedtools closest -s -d -a human_rnaseq_bwa_sort.mapped.q1.merge.sort.bed -b gencode.v19.proteincoding.genes.sort.merge.final > hg19_rnaseq_protcode_closest.bed 

wc -l hg19_rnaseq_protcode_closest.bed
#XXX#7134 hg19_rnaseq_protcode_closest.bed  #same more hg19_rnaseq_protcode_closest.bed

sort -k12,12nrlength as the original file

more hg19_rnaseq_protcode_closest.bed
> hg19_rnaseq_protcode_closest.sort.bed

more hg19_rnaseq_protcode_closest.sort.bed

...

#chr1    880458    880529    1    37    +    chr1    860260    879955    SAMD11    .    +    504
#chr1    881549    881650    1    37    -    chr1    879584    894689    NOC2L     .    -    0
#chr1    887884    887985    1    37    +    chr1    860260    879955    SAMD11    .    +    7930
#chr1    892309    892410    1    37    -    chr1    879584    894689    NOC2L     .    -    0
#chr1    892475    892576    1    23    +    chr1    895967    901095    KLHL17    .    +    3392

#sort by the distance to a gene, longest distances first
sort -k13,13nr hg19_rnaseq_protcode_closest.bed | more 

#sort by the distance to a gene, shortest distances first
sort -k13,13n hg19_rnaseq_protcode_closest.bed | more

This is a nice way to examine your reads over annotated protein-coding genes. Note the strand specificity - only reads on the correct strand will be reported when there is a + strand gene and a - strand gene over the same location.

bedtools subtract: removing features from your bed file

Bedtools subtract takes an A file and a B file, then searches for features in B that overlap A. When/if these features are identified, the overlapping portion is removed from A and the remaining portion of A is reported. If a feature in B overlaps all of a feature in A, that feature will not be reported.

Code Block

language	bash
title	bedtools subtract options

bedtools subtract [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>

Note that bedtools subtract is performed on two files, and unlike some of the other utilities we've used, you can't use multiple B features here. However, you can use cat to join together features you'd like to subtract from your A file.

Expand

title	bedtools closest options

f: Minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp).
F: Minimum overlap required as a fraction of B. Default is 1E-9 (i.e., 1bp).
r: Require that the fraction of overlap be reciprocal for A and B. In other words, if -f is 0.90 and -r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B.
e: Require that the minimum fraction be satisfied for A _OR_ B. In other words, if -e is used with -f 0.90 and -F 0.10 this requires that either 90% of A is covered OR 10% of B is covered. Without -e, both fractions would have to be satisfied.**-s** Force “strandedness”. That is, only report hits in B that overlap A on the same strand. By default, overlaps are reported without respect to strand.
S: Require different strandedness. That is, only report hits in B that overlap A on the _opposite_ strand. By default, overlaps are reported without respect to strand
A: Remove entire feature if any overlap. That is, by default, only subtract the portion of A that overlaps B. Here, if any overlap is found (or -f amount), the entire feature is removed.
N: Same as -A except when used with -f, the amount is the sum of all features (not any single feature)

Let's do a little set-up for the next exercise:

Code Block

language	bash
title	copy some gencode files over

cds
mkdir subtract
cd subtract
cp /scratch/01786/awh394/core_ngs/day4_2015/bedtools/gencode.v19.proteincoding.genes.sort.merge.final .
cp /scratch/01786/awh394/core_ngs/day4_2015/bedtools/gencode.v19.genes.sort.merge.final .

Exercise 7: remove the protein-coding genes from a gencode list of genes using subtract, then give a count of the non-protein-coding gene entries

Expand

language	bash
title	click here for the bedtools subtract code and output

My output is commented in this code block.

Code Block

language	bash

cd subtract 
module load bedtools #if you haven't loaded it up yet this session
bedtools subtract -s -d -a gencode.v19.genes.sort.merge.final -b gencode.v19.proteincoding.genes.sort.merge.final > gencode.v19.not.proteincoding.genes.bed

wc -l gencode.v19.not.proteincoding.genes.bed
#XXX gencode.v19.not.proteincoding.genes.bed

more gencode.v19.not.proteincoding.genes.bed

###commentary on the output and exerciseA, that feature will not be reported.

Code Block

language	bash
title	bedtools subtract options

bedtools subtract [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>

Note that bedtools subtract is performed on two files, and unlike some of the other utilities we've used, you can't use multiple B features here. However, you can use cat to join together features you'd like to subtract from your A file.

Expand

title	bedtools closest options

f: Minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp).
F: Minimum overlap required as a fraction of B. Default is 1E-9 (i.e., 1bp).
r: Require that the fraction of overlap be reciprocal for A and B. In other words, if -f is 0.90 and -r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B.
e: Require that the minimum fraction be satisfied for A _OR_ B. In other words, if -e is used with -f 0.90 and -F 0.10 this requires that either 90% of A is covered OR 10% of B is covered. Without -e, both fractions would have to be satisfied.**-s** Force “strandedness”. That is, only report hits in B that overlap A on the same strand. By default, overlaps are reported without respect to strand.
S: Require different strandedness. That is, only report hits in B that overlap A on the _opposite_ strand. By default, overlaps are reported without respect to strand
A: Remove entire feature if any overlap. That is, by default, only subtract the portion of A that overlaps B. Here, if any overlap is found (or -f amount), the entire feature is removed.
N: Same as -A except when used with -f, the amount is the sum of all features (not any single feature)

Let's do a little set-up for the next exercise:

Code Block

language	bash
title	copy some gencode files over

cd $SCRATCH/core_ngs
mkdir subtract
cd subtract
cp /scratch/01786/awh394/core_ngs/closest/gencode.v19.proteincoding.genes.sort.merge.final .
cp /scratch/01786/awh394/core_ngs/closest/gencode.v19.genes.sort.merge.final .

Exercise 7: remove the protein-coding genes from a gencode list of genes using subtract, then give a count of the non-protein-coding gene entries

This allows you to identify which gene regions are not protein coding, and are likely pseudogenes, but could also be miRNAs, snRNAs or other genes that aren't translated into a peptide sequence.

Expand

language	bash
title	click here for the bedtools subtract code and output

My output is commented in this code block.

Code Block

language	bash

cd subtract 
module load bedtools #if you haven't loaded it up yet this session
bedtools subtract -a gencode.v19.genes.sort.merge.final -b gencode.v19.proteincoding.genes.sort.merge.final > gencode.v19.notproteincoding.genes.bed

wc -l gencode.v19.not.proteincoding.genes.bed
#23483 gencode.v19.notproteincoding.genes.bed

more gencode.v19.not.proteincoding.genes.bed
#chr1    11869    14412    DDX11L1    .    +
#chr1    14363    29806    WASH7P    .    -
#chr1    29554    31109    MIR1302-11    .    +
#chr1    34554    36081    FAM138A    .    -
#chr1    52473    54936    OR4G4P    .    +
#chr1    62948    63887    OR4G11P    .    +

While the above example is not super useful in all cases, one might use the above workflow to remove genes that aren't of interest from a larger set.

A little bit of filtering, using awk

...

Versions Compared

Old Version 103

New Version 104

Key

bedtools subtract: removing features from your bed file

A little bit of filtering, using awk