...
Expand |
---|
title | click here to see my output on copying files |
---|
|
Code Block |
---|
| -rwxrwxrwxrwxrwxr-x 1 awh394 G-801021 1.9M May 1722 0920:3041 gencode.v19.genes.sort.merge.final
-rwxrwxr-rwxrwxrwxx 1 awh394 G-801021 646K May 1722 0920:3041 gencode.v19.proteincoding.genes.sort.merge.final |
|
Exercise 6: Identify the closest protein coding genes (on the same strand) for the human rnaseq file using closest, then sort by the distance column (largest, then smallest distance first).
Expand |
---|
language | bash |
---|
title | click here for the bedtools closest code and output |
---|
|
My output is commented in this code block. Code Block |
---|
| cd intersectclosest
module load bedtools #if you haven't loaded it up yet this session
sort -k1,1 -k2,2n human_rnaseq_bwa_sort.mapped.q1.merge.bed > human_rnaseq_bwa.mapped.q1.merge.sort.bed #need to sort both files to the same order
bedtools closest -s -d -a human_rnaseq_bwa_sort.mapped.q1.merge.sort.bed -b gencode.v19.proteincoding.genes.sort.merge.final > hg19_rnaseq_protcode_closest.bed
wc -l hg19_rnaseq_protcode_closest.bed
#XXX#7134 hg19_rnaseq_protcode_closest.bed #same more hg19_rnaseq_protcode_closest.bed
sort -k12,12nrlength as the original file
more hg19_rnaseq_protcode_closest.bed
> hg19_rnaseq_protcode_closest.sort.bed
more hg19_rnaseq_protcode_closest.sort.bed |
|
...
#chr1 880458 880529 1 37 + chr1 860260 879955 SAMD11 . + 504
#chr1 881549 881650 1 37 - chr1 879584 894689 NOC2L . - 0
#chr1 887884 887985 1 37 + chr1 860260 879955 SAMD11 . + 7930
#chr1 892309 892410 1 37 - chr1 879584 894689 NOC2L . - 0
#chr1 892475 892576 1 23 + chr1 895967 901095 KLHL17 . + 3392
#sort by the distance to a gene, longest distances first
sort -k13,13nr hg19_rnaseq_protcode_closest.bed | more
#sort by the distance to a gene, shortest distances first
sort -k13,13n hg19_rnaseq_protcode_closest.bed | more
|
|
This is a nice way to examine your reads over annotated protein-coding genes. Note the strand specificity - only reads on the correct strand will be reported when there is a + strand gene and a - strand gene over the same location.
bedtools subtract: removing features from your bed file
Bedtools subtract takes an A file and a B file, then searches for features in B that overlap A. When/if these features are identified, the overlapping portion is removed from A and the remaining portion of A is reported. If a feature in B overlaps all of a feature in A, that feature will not be reported.
Code Block |
---|
language | bash |
---|
title | bedtools subtract options |
---|
|
bedtools subtract [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF> |
Note that bedtools subtract is performed on two files, and unlike some of the other utilities we've used, you can't use multiple B features here. However, you can use cat to join together features you'd like to subtract from your A file.
Expand |
---|
title | bedtools closest options |
---|
|
- f: Minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp).
- F: Minimum overlap required as a fraction of B. Default is 1E-9 (i.e., 1bp).
- r: Require that the fraction of overlap be reciprocal for A and B. In other words, if -f is 0.90 and -r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B.
e: Require that the minimum fraction be satisfied for A _OR_ B. In other words, if -e is used with -f 0.90 and -F 0.10 this requires that either 90% of A is covered OR 10% of B is covered. Without -e, both fractions would have to be satisfied.**-s** Force “strandedness”. That is, only report hits in B that overlap A on the same strand. By default, overlaps are reported without respect to strand. S: Require different strandedness. That is, only report hits in B that overlap A on the _opposite_ strand. By default, overlaps are reported without respect to strand - A: Remove entire feature if any overlap. That is, by default, only subtract the portion of A that overlaps B. Here, if any overlap is found (or
-f amount), the entire feature is removed. N: Same as -A except when used with -f, the amount is the sum of all features (not any single feature)
|
Let's do a little set-up for the next exercise:
Code Block |
---|
language | bash |
---|
title | copy some gencode files over |
---|
|
cds
mkdir subtract
cd subtract
cp /scratch/01786/awh394/core_ngs/day4_2015/bedtools/gencode.v19.proteincoding.genes.sort.merge.final .
cp /scratch/01786/awh394/core_ngs/day4_2015/bedtools/gencode.v19.genes.sort.merge.final . |
Exercise 7: remove the protein-coding genes from a gencode list of genes using subtract, then give a count of the non-protein-coding gene entries
Expand |
---|
language | bash |
---|
title | click here for the bedtools subtract code and output |
---|
|
My output is commented in this code block. Code Block |
---|
| cd subtract
module load bedtools #if you haven't loaded it up yet this session
bedtools subtract -s -d -a gencode.v19.genes.sort.merge.final -b gencode.v19.proteincoding.genes.sort.merge.final > gencode.v19.not.proteincoding.genes.bed
wc -l gencode.v19.not.proteincoding.genes.bed
#XXX gencode.v19.not.proteincoding.genes.bed
more gencode.v19.not.proteincoding.genes.bed
|
|
###commentary on the output and exerciseA, that feature will not be reported.
Code Block |
---|
language | bash |
---|
title | bedtools subtract options |
---|
|
bedtools subtract [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF> |
Note that bedtools subtract is performed on two files, and unlike some of the other utilities we've used, you can't use multiple B features here. However, you can use cat to join together features you'd like to subtract from your A file.
Expand |
---|
title | bedtools closest options |
---|
|
- f: Minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp).
- F: Minimum overlap required as a fraction of B. Default is 1E-9 (i.e., 1bp).
- r: Require that the fraction of overlap be reciprocal for A and B. In other words, if -f is 0.90 and -r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B.
e: Require that the minimum fraction be satisfied for A _OR_ B. In other words, if -e is used with -f 0.90 and -F 0.10 this requires that either 90% of A is covered OR 10% of B is covered. Without -e, both fractions would have to be satisfied.**-s** Force “strandedness”. That is, only report hits in B that overlap A on the same strand. By default, overlaps are reported without respect to strand. S: Require different strandedness. That is, only report hits in B that overlap A on the _opposite_ strand. By default, overlaps are reported without respect to strand - A: Remove entire feature if any overlap. That is, by default, only subtract the portion of A that overlaps B. Here, if any overlap is found (or
-f amount), the entire feature is removed. N: Same as -A except when used with -f, the amount is the sum of all features (not any single feature)
|
Let's do a little set-up for the next exercise:
Code Block |
---|
language | bash |
---|
title | copy some gencode files over |
---|
|
cd $SCRATCH/core_ngs
mkdir subtract
cd subtract
cp /scratch/01786/awh394/core_ngs/closest/gencode.v19.proteincoding.genes.sort.merge.final .
cp /scratch/01786/awh394/core_ngs/closest/gencode.v19.genes.sort.merge.final . |
Exercise 7: remove the protein-coding genes from a gencode list of genes using subtract, then give a count of the non-protein-coding gene entries
This allows you to identify which gene regions are not protein coding, and are likely pseudogenes, but could also be miRNAs, snRNAs or other genes that aren't translated into a peptide sequence.
Expand |
---|
language | bash |
---|
title | click here for the bedtools subtract code and output |
---|
|
My output is commented in this code block. Code Block |
---|
| cd subtract
module load bedtools #if you haven't loaded it up yet this session
bedtools subtract -a gencode.v19.genes.sort.merge.final -b gencode.v19.proteincoding.genes.sort.merge.final > gencode.v19.notproteincoding.genes.bed
wc -l gencode.v19.not.proteincoding.genes.bed
#23483 gencode.v19.notproteincoding.genes.bed
more gencode.v19.not.proteincoding.genes.bed
#chr1 11869 14412 DDX11L1 . +
#chr1 14363 29806 WASH7P . -
#chr1 29554 31109 MIR1302-11 . +
#chr1 34554 36081 FAM138A . -
#chr1 52473 54936 OR4G4P . +
#chr1 62948 63887 OR4G11P . + |
|
While the above example is not super useful in all cases, one might use the above workflow to remove genes that aren't of interest from a larger set.
A little bit of filtering, using awk
...