Where can I find a (or the canonical) collection of "reference protein-coding sequences" for mouse (and/or S Pombe)?
For context, I am trying to make the Oracle Set referred to in the recent Nature paper on Trinity (http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.1883.html). From the paper:
We next estimated the upper sensitivity limit for which annotated transcripts can possibly be perfectly reconstructed given a particular data set of sequences. Any assembly approach based on a particular k-length oligomer is limited to those sequences that are represented by the exact k-mer composition of the RNA-Seq read set. To determine this empirical upper sensitivity limit, we built a k-mer dictionary from all the reads and identified all known reference protein-coding sequences that are reconstructable to full length given the read set, as those sequences that can be populated by adjacent and overlapping k-mers across their entire length. We call this set of sequences the 'Oracle Set'. Because this set also contains transcript sequences that are covered by k-mers, but not entire reads, some transcripts will appear reconstructable but are not. Conversely, the Oracle Set reflects only annotated known genes and known isoforms, which are likely an underestimate, especially in mammals16. Nevertheless, the Oracle Set provides a useful sensitivity benchmark.
Thanks in advance for any help! (And I apologize if this is a stupid question - I am just starting out in bioinformatics research.)