Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Exercise 7:  remove the protein-coding genes from a gencode list of genes using subtract, then give a count of the non-protein-coding gene entries

This allows you to identify which gene regions are not protein coding, and are likely pseudogenes, but could also be miRNAs, snRNAs or other genes that aren't translated into a peptide sequence.

Expand
languagebash
titleclick here for the bedtools subtract code and output

My output is commented in this code block.

Code Block
languagebash
cd subtract 
module load bedtools #if you haven't loaded it up yet this session
bedtools subtract -a gencode.v19.genes.sort.merge.final -b gencode.v19.proteincoding.genes.sort.merge.final > gencode.v19.notproteincoding.genes.bed

wc -l gencode.v19.not.proteincoding.genes.bed
#23483 gencode.v19.notproteincoding.genes.bed

more gencode.v19.not.proteincoding.genes.bed
#chr1    11869    14412    DDX11L1    .    +
#chr1    14363    29806    WASH7P    .    -
#chr1    29554    31109    MIR1302-11    .    +
#chr1    34554    36081    FAM138A    .    -
#chr1    52473    54936    OR4G4P    .    +
#chr1    62948    63887    OR4G11P    .    +

...

As a final note, yesterday Nathan we taught you about using a lot of unix utilities, including uniq, sort and cut.  One last utility I'd like to add, that is very useful for manipulating these types of tab delimited files, is awkAwk isn't a command, but rather a little text manipulation language in it's own right (which we briefly used above to rearrange the columns in a file).  While awk can be used to do many different things, here we'll primarily use it to sort tab delimited files based on the values present in those files.  That is useful to filter your files for entries on a given chromosome, or greater than/less than a given score.  If your dataset is large, this type of filtering can be invaluable!  Below is an example of a simple awk script:

...