Question: Database of genes with manually curated/fixed sequences?
2.7 years ago
Yannick Wurm
Queen Mary University London
Yannick Wurm wrote:

When working with gene sequences from an understudied species, it can be challenging to know whether the gene prediction model is correct. Methods for determining this include comparisons to other "known" gene predictions as well as RNAseq data - manually or with the help of tools like GeneValidator (caveat: I am a coauthor).

Such approaches rest on having high quality databases to compare to. Many of us know that the SwissProt database is of high quality because the gene predictions it contains are manually examined and fixed (i.e., curated) by expert users. But it only contains (relatively) few genes from few organisms.

Additional curation occurs as part of most new genome projects: dozens to hundreds of gene predictions are similarly manually curated by phd students, postdocs, staff scientists and professors. However, as far as I know - the knowledge of which gene predictions were curated and which are raw & uncurated  are lost in the the supplementary materials of every paper because the manual and automatically determined gene predictions are merged into a single official geneset before submission to NCBI. This is potentially a huge loss. Or am I missing something?

i.e., Is there a database which centralizes curated gene predictions? Or a "tag" by which to identify manually curated gene predictions present in NCBI nr?

Thanks! Yannick

2.7 years ago
a.zielezinski wrote:

RefSeq, which is a part of NCBI nr provides a clear distinction between predicted and curated protein/nucleotide sequences. This information is stored in the record's accession number. The accession numbers in Refseq have a format of 2 letters + underbar + 6 digits (i.e. NM_123456). Accession numbers that begin with the letter X [XM_ (mRNA), XR_ (non-coding RNA), and XP_ (protein)] are molecule models predicted by NCBI’s genome annotation pipeline. However, accession numbers that start with the letter N are curated (manually reviewed by NCBI staff or collaborators).



Thanks Andrzej - this is v helpful. I didn't realize this distinction. Do you know whether the 2 types of submissions people sequencing a new genome make would be similarly differentiated?

