...
Exercise 7: remove the protein-coding genes from a gencode list of genes using subtract, then give a count of the non-protein-coding gene entries
This allows you to identify which gene regions are not protein coding, and are likely pseudogenes, but could also be miRNAs, snRNAs or other genes that aren't translated into a peptide sequence.
Expand | |||||
---|---|---|---|---|---|
| |||||
My output is commented in this code block.
|
...
As a final note, yesterday Nathan we taught you about using a lot of unix utilities, including uniq, sort and cut. One last utility I'd like to add, that is very useful for manipulating these types of tab delimited files, is awk. Awk isn't a command, but rather a little text manipulation language in it's own right (which we briefly used above to rearrange the columns in a file). While awk can be used to do many different things, here we'll primarily use it to sort tab delimited files based on the values present in those files. That is useful to filter your files for entries on a given chromosome, or greater than/less than a given score. If your dataset is large, this type of filtering can be invaluable! Below is an example of a simple awk script:
...