Question

Reference Protein-Coding Sequence

1

Entering edit mode

12.9 years ago

Not Durrett ▴ 10

Hello,

Where can I find a (or the canonical) collection of "reference protein-coding sequences" for mouse (and/or S Pombe)?

For context, I am trying to make the Oracle Set referred to in the recent Nature paper on Trinity (http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.1883.html). From the paper:

We next estimated the upper sensitivity limit for which annotated transcripts can possibly be perfectly reconstructed given a particular data set of sequences. Any assembly approach based on a particular k-length oligomer is limited to those sequences that are represented by the exact k-mer composition of the RNA-Seq read set. To determine this empirical upper sensitivity limit, we built a k-mer dictionary from all the reads and identified all known reference protein-coding sequences that are reconstructable to full length given the read set, as those sequences that can be populated by adjacent and overlapping k-mers across their entire length. We call this set of sequences the 'Oracle Set'. Because this set also contains transcript sequences that are covered by k-mers, but not entire reads, some transcripts will appear reconstructable but are not. Conversely, the Oracle Set reflects only annotated known genes and known isoforms, which are likely an underestimate, especially in mammals16. Nevertheless, the Oracle Set provides a useful sensitivity benchmark.

Thanks in advance for any help! (And I apologize if this is a stupid question - I am just starting out in bioinformatics research.)

reference sequence protein rna • 2.5k views

ADD COMMENT • link updated 12.9 years ago by Not Durrett • 0 • written 12.9 years ago by Not Durrett ▴ 10

0

Entering edit mode

Thanks (very late, I know) to both of you for the answers - just what I was looking for.

ADD REPLY • link 12.9 years ago by Not Durrett • 0

score 3 · Answer 1 · 2011-06-14

3

Entering edit mode

12.9 years ago

Michael Schubert ★ 7.1k

2 Links for you:

the RefSeq FAQ
and the FTP URL for mouse

ADD COMMENT • link 12.9 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

Hey, why a negative vote on this one? This is a perfectly valid answer!

ADD REPLY • link 12.9 years ago by Lyco ★ 2.3k

0

Entering edit mode

indeed a bit peculiar ;)

ADD REPLY • link 12.9 years ago by Michael Schubert ★ 7.1k

score 2 · Answer 2 · 2011-06-14

I haven't read this particular paper, but most people think of the RefSeq database when using the word 'reference sequence' - especially when they talk about mammalian sequences. So this would be the best bet for the mouse reference sequences. An alternative explanation would be the 'reference sequence' as published by the associated genome project - this is often the case when people talk about bacterial sequences or simple eukaryotes. In the case of pombe, this would probably be the version at the Sanger centre.

score 0 · Answer 3 · 2011-06-22

0

Entering edit mode

12.9 years ago

Not Durrett • 0

Thanks (very late, I know) to both of you for the answers - just what I was looking for.

[I know that it is inappropriate to post this as an answer. I asked my question on a public lab computer without logging in, and I am not able to comment as a new user.]

[argh, can't delete it now.]

-OP

ADD COMMENT • link 12.9 years ago by Not Durrett • 0

0

Entering edit mode

If this answers your questions, you could 'close' this subject by accepting one of the answers.

ADD REPLY • link 12.9 years ago by Lyco ★ 2.3k