Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleclick here to see my output on copying files
Code Block
languagebash
-rwxrwxrwxrwxrwxr-x 1 awh394 G-801021 1.9M May 1722 0920:3041 gencode.v19.genes.sort.merge.final
-rwxrwxr-rwxrwxrwxx 1 awh394 G-801021 646K May 1722 0920:3041 gencode.v19.proteincoding.genes.sort.merge.final

Exercise 6: Identify the closest protein coding genes (on the same strand) for the human rnaseq file using closest, then sort by the distance column (largest, then smallest distance first).

Expand
languagebash
titleclick here for the bedtools closest code and output

My output is commented in this code block.

Code Block
languagebash
cd intersectclosest 
module load bedtools #if you haven't loaded it up yet this session
sort -k1,1 -k2,2n human_rnaseq_bwa_sort.mapped.q1.merge.bed > human_rnaseq_bwa.mapped.q1.merge.sort.bed #need to sort both files to the same order
bedtools closest -s -d -a human_rnaseq_bwa_sort.mapped.q1.merge.sort.bed -b gencode.v19.proteincoding.genes.sort.merge.final > hg19_rnaseq_protcode_closest.bed 

wc -l hg19_rnaseq_protcode_closest.bed
#XXX#7134 hg19_rnaseq_protcode_closest.bed  #same more hg19_rnaseq_protcode_closest.bed

sort -k12,12nrlength as the original file

more hg19_rnaseq_protcode_closest.bed
> hg19_rnaseq_protcode_closest.sort.bed

more hg19_rnaseq_protcode_closest.sort.bed

...

#chr1    880458    880529    1    37    +    chr1    860260    879955    SAMD11    .    +    504
#chr1    881549    881650    1    37    -    chr1    879584    894689    NOC2L     .    -    0
#chr1    887884    887985    1    37    +    chr1    860260    879955    SAMD11    .    +    7930
#chr1    892309    892410    1    37    -    chr1    879584    894689    NOC2L     .    -    0
#chr1    892475    892576    1    23    +    chr1    895967    901095    KLHL17    .    +    3392

#sort by the distance to a gene, longest distances first
sort -k13,13nr hg19_rnaseq_protcode_closest.bed | more 

#sort by the distance to a gene, shortest distances first
sort -k13,13n hg19_rnaseq_protcode_closest.bed | more

This is a nice way to examine your reads over annotated protein-coding genes.  Note the strand specificity - only reads on the correct strand will be reported when there is a + strand gene and a - strand gene over the same location.

bedtools subtract:  removing features from your bed file

Bedtools subtract takes an A file and a B file, then searches for features in B that overlap A.  When/if these features are identified, the overlapping portion is removed from A and the remaining portion of A is reported.  If a feature in B overlaps all of a feature in A, that feature will not be reported.

Code Block
languagebash
titlebedtools subtract options
bedtools subtract [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>

Note that bedtools subtract is performed on two files, and unlike some of the other utilities we've used, you can't use multiple B features here.  However, you can use cat to join together features you'd like to subtract from your A file.

Expand
titlebedtools closest options
  • f:    Minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp).
  • F:   Minimum overlap required as a fraction of B. Default is 1E-9 (i.e., 1bp).
  • r:    Require that the fraction of overlap be reciprocal for A and B. In other words, if -f is 0.90 and -r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B.
  • e:   Require that the minimum fraction be satisfied for A _OR_ B. In other words, if -e is used with -f 0.90 and -F 0.10 this requires that either 90% of A is covered OR 10% of B is covered. Without -e, both fractions would have to be satisfied.**-s** Force “strandedness”. That is, only report hits in B that overlap A on the same strand. By default, overlaps are reported without respect to strand.

  • S:   Require different strandedness. That is, only report hits in B that overlap A on the _opposite_ strand. By default, overlaps are reported without respect to strand

  • A:   Remove entire feature if any overlap. That is, by default, only subtract the portion of A that overlaps B. Here, if any overlap is found (or -f amount), the entire feature is removed.
  • N:   Same as -A except when used with -f, the amount is the sum of all features (not any single feature)

Let's do a little set-up for the next exercise:

Code Block
languagebash
titlecopy some gencode files over
cds
mkdir subtract
cd subtract
cp /scratch/01786/awh394/core_ngs/day4_2015/bedtools/gencode.v19.proteincoding.genes.sort.merge.final .
cp /scratch/01786/awh394/core_ngs/day4_2015/bedtools/gencode.v19.genes.sort.merge.final .

Exercise 7:  remove the protein-coding genes from a gencode list of genes using subtract, then give a count of the non-protein-coding gene entries

Expand
languagebash
titleclick here for the bedtools subtract code and output

My output is commented in this code block.

Code Block
languagebash
cd subtract 
module load bedtools #if you haven't loaded it up yet this session
bedtools subtract -s -d -a gencode.v19.genes.sort.merge.final -b gencode.v19.proteincoding.genes.sort.merge.final > gencode.v19.not.proteincoding.genes.bed

wc -l gencode.v19.not.proteincoding.genes.bed
#XXX gencode.v19.not.proteincoding.genes.bed

more gencode.v19.not.proteincoding.genes.bed

###commentary on the output and exerciseA, that feature will not be reported.

Code Block
languagebash
titlebedtools subtract options
bedtools subtract [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>

Note that bedtools subtract is performed on two files, and unlike some of the other utilities we've used, you can't use multiple B features here.  However, you can use cat to join together features you'd like to subtract from your A file.

Expand
titlebedtools closest options
  • f:    Minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp).
  • F:   Minimum overlap required as a fraction of B. Default is 1E-9 (i.e., 1bp).
  • r:    Require that the fraction of overlap be reciprocal for A and B. In other words, if -f is 0.90 and -r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B.
  • e:   Require that the minimum fraction be satisfied for A _OR_ B. In other words, if -e is used with -f 0.90 and -F 0.10 this requires that either 90% of A is covered OR 10% of B is covered. Without -e, both fractions would have to be satisfied.**-s** Force “strandedness”. That is, only report hits in B that overlap A on the same strand. By default, overlaps are reported without respect to strand.

  • S:   Require different strandedness. That is, only report hits in B that overlap A on the _opposite_ strand. By default, overlaps are reported without respect to strand

  • A:   Remove entire feature if any overlap. That is, by default, only subtract the portion of A that overlaps B. Here, if any overlap is found (or -f amount), the entire feature is removed.
  • N:   Same as -A except when used with -f, the amount is the sum of all features (not any single feature)

Let's do a little set-up for the next exercise:

Code Block
languagebash
titlecopy some gencode files over
cd $SCRATCH/core_ngs
mkdir subtract
cd subtract
cp /scratch/01786/awh394/core_ngs/closest/gencode.v19.proteincoding.genes.sort.merge.final .
cp /scratch/01786/awh394/core_ngs/closest/gencode.v19.genes.sort.merge.final .

Exercise 7:  remove the protein-coding genes from a gencode list of genes using subtract, then give a count of the non-protein-coding gene entries

This allows you to identify which gene regions are not protein coding, and are likely pseudogenes, but could also be miRNAs, snRNAs or other genes that aren't translated into a peptide sequence.

Expand
languagebash
titleclick here for the bedtools subtract code and output

My output is commented in this code block.

Code Block
languagebash
cd subtract 
module load bedtools #if you haven't loaded it up yet this session
bedtools subtract -a gencode.v19.genes.sort.merge.final -b gencode.v19.proteincoding.genes.sort.merge.final > gencode.v19.notproteincoding.genes.bed

wc -l gencode.v19.not.proteincoding.genes.bed
#23483 gencode.v19.notproteincoding.genes.bed

more gencode.v19.not.proteincoding.genes.bed
#chr1    11869    14412    DDX11L1    .    +
#chr1    14363    29806    WASH7P    .    -
#chr1    29554    31109    MIR1302-11    .    +
#chr1    34554    36081    FAM138A    .    -
#chr1    52473    54936    OR4G4P    .    +
#chr1    62948    63887    OR4G11P    .    +

While the above example is not super useful in all cases, one might use the above workflow to remove genes that aren't of interest from a larger set.

A little bit of filtering, using awk

...