Question

Database of genes with manually curated/fixed sequences?

4

Entering edit mode

8.6 years ago

Yannick Wurm ★ 2.5k

When working with gene sequences from an understudied species, it can be challenging to know whether the gene prediction model is correct. Methods for determining this include comparisons to other "known" gene predictions as well as RNAseq data - manually or with the help of tools like GeneValidator (caveat: I am a coauthor).

Such approaches rest on having high quality databases to compare to. Many of us know that the SwissProt database is of high quality because the gene predictions it contains are manually examined and fixed (i.e., curated) by expert users. But it only contains (relatively) few genes from few organisms.

Additional curation occurs as part of most new genome projects: dozens to hundreds of gene predictions are similarly manually curated by phd students, postdocs, staff scientists and professors. However, as far as I know - the knowledge of which gene predictions were curated and which are raw & uncurated are lost in the the supplementary materials of every paper because the manual and automatically determined gene predictions are merged into a single official geneset before submission to NCBI. This is potentially a huge loss. Or am I missing something?

i.e., Is there a database which centralizes curated gene predictions? Or a "tag" by which to identify manually curated gene predictions present in NCBI nr?

Thanks! Yannick

homology database gene-prediction similarity • 2.6k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by Yannick Wurm ★ 2.5k

Ram · Answer 1 · 2015-09-21

4

Entering edit mode

8.6 years ago

Andrzej Zielezinski 11k

RefSeq, which is a part of NCBI nr provides a clear distinction between predicted and curated protein/nucleotide sequences. This information is stored in the record's accession number. The accession numbers in Refseq have a format of 2 letters + underbar + 6 digits (i.e. NM_123456). Accession numbers that begin with the letter X [XM_ (mRNA), XR_ (non-coding RNA), and XP_ (protein)] are molecule models predicted by NCBI's genome annotation pipeline. However, accession numbers that start with the letter N are curated (manually reviewed by NCBI staff or collaborators).

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Thanks Andrzej - this is very helpful. I didn't realize this distinction. Do you know whether the 2 types of submissions people sequencing a new genome make would be similarly differentiated?

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.6 years ago by Yannick Wurm ★ 2.5k