Reference Protein-Coding Sequence
3
1
Entering edit mode
12.9 years ago
Not Durrett ▴ 10

Hello,

Where can I find a (or the canonical) collection of "reference protein-coding sequences" for mouse (and/or S Pombe)?

For context, I am trying to make the Oracle Set referred to in the recent Nature paper on Trinity (http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.1883.html). From the paper:

We next estimated the upper sensitivity limit for which annotated transcripts can possibly be perfectly reconstructed given a particular data set of sequences. Any assembly approach based on a particular k-length oligomer is limited to those sequences that are represented by the exact k-mer composition of the RNA-Seq read set. To determine this empirical upper sensitivity limit, we built a k-mer dictionary from all the reads and identified all known reference protein-coding sequences that are reconstructable to full length given the read set, as those sequences that can be populated by adjacent and overlapping k-mers across their entire length. We call this set of sequences the 'Oracle Set'. Because this set also contains transcript sequences that are covered by k-mers, but not entire reads, some transcripts will appear reconstructable but are not. Conversely, the Oracle Set reflects only annotated known genes and known isoforms, which are likely an underestimate, especially in mammals16. Nevertheless, the Oracle Set provides a useful sensitivity benchmark.

Thanks in advance for any help! (And I apologize if this is a stupid question - I am just starting out in bioinformatics research.)

reference sequence protein rna • 2.5k views
ADD COMMENT
0
Entering edit mode

Thanks (very late, I know) to both of you for the answers - just what I was looking for.

ADD REPLY
3
Entering edit mode
12.9 years ago

2 Links for you:

ADD COMMENT
0
Entering edit mode

Hey, why a negative vote on this one? This is a perfectly valid answer!

ADD REPLY
0
Entering edit mode

indeed a bit peculiar ;)

ADD REPLY
2
Entering edit mode
12.9 years ago
Lyco ★ 2.3k

I haven't read this particular paper, but most people think of the RefSeq database when using the word 'reference sequence' - especially when they talk about mammalian sequences. So this would be the best bet for the mouse reference sequences. An alternative explanation would be the 'reference sequence' as published by the associated genome project - this is often the case when people talk about bacterial sequences or simple eukaryotes. In the case of pombe, this would probably be the version at the Sanger centre.

ADD COMMENT
0
Entering edit mode
12.9 years ago

Thanks (very late, I know) to both of you for the answers - just what I was looking for.

[I know that it is inappropriate to post this as an answer. I asked my question on a public lab computer without logging in, and I am not able to comment as a new user.]

[argh, can't delete it now.]

-OP

ADD COMMENT
0
Entering edit mode

If this answers your questions, you could 'close' this subject by accepting one of the answers.

ADD REPLY

Login before adding your answer.

Traffic: 3001 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6