Question: Database of genes with manually curated/fixed sequences?
gravatar for Yannick Wurm
2.7 years ago by
Yannick Wurm2.3k
Queen Mary University London
Yannick Wurm2.3k wrote:

When working with gene sequences from an understudied species, it can be challenging to know whether the gene prediction model is correct. Methods for determining this include comparisons to other "known" gene predictions as well as RNAseq data - manually or with the help of tools like GeneValidator (caveat: I am a coauthor).

Such approaches rest on having high quality databases to compare to. Many of us know that the SwissProt database is of high quality because the gene predictions it contains are manually examined and fixed (i.e., curated) by expert users. But it only contains (relatively) few genes from few organisms.

Additional curation occurs as part of most new genome projects: dozens to hundreds of gene predictions are similarly manually curated by phd students, postdocs, staff scientists and professors. However, as far as I know - the knowledge of which gene predictions were curated and which are raw & uncurated  are lost in the the supplementary materials of every paper because the manual and automatically determined gene predictions are merged into a single official geneset before submission to NCBI. This is potentially a huge loss. Or am I missing something?

i.e., Is there a database which centralizes curated gene predictions? Or a "tag" by which to identify manually curated gene predictions present in NCBI nr?

Thanks! Yannick

ADD COMMENTlink modified 2.7 years ago by a.zielezinski8.3k • written 2.7 years ago by Yannick Wurm2.3k
gravatar for a.zielezinski
2.7 years ago by
a.zielezinski8.3k wrote:

RefSeq, which is a part of NCBI nr provides a clear distinction between predicted and curated protein/nucleotide sequences. This information is stored in the record's accession number. The accession numbers in Refseq have a format of 2 letters + underbar + 6 digits (i.e. NM_123456). Accession numbers that begin with the letter X [XM_ (mRNA), XR_ (non-coding RNA), and XP_ (protein)] are molecule models predicted by NCBI’s genome annotation pipeline. However, accession numbers that start with the letter N are curated (manually reviewed by NCBI staff or collaborators).



ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by a.zielezinski8.3k

Thanks Andrzej - this is v helpful. I didn't realize this distinction. Do you know whether the 2 types of submissions people sequencing a new genome make would be similarly differentiated?

ADD REPLYlink written 2.7 years ago by Yannick Wurm2.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 870 users visited in the last hour