Database of genes with manually curated/fixed sequences?
Entering edit mode
8.4 years ago
Yannick Wurm ★ 2.5k

When working with gene sequences from an understudied species, it can be challenging to know whether the gene prediction model is correct. Methods for determining this include comparisons to other "known" gene predictions as well as RNAseq data - manually or with the help of tools like GeneValidator (caveat: I am a coauthor).

Such approaches rest on having high quality databases to compare to. Many of us know that the SwissProt database is of high quality because the gene predictions it contains are manually examined and fixed (i.e., curated) by expert users. But it only contains (relatively) few genes from few organisms.

Additional curation occurs as part of most new genome projects: dozens to hundreds of gene predictions are similarly manually curated by phd students, postdocs, staff scientists and professors. However, as far as I know - the knowledge of which gene predictions were curated and which are raw & uncurated are lost in the the supplementary materials of every paper because the manual and automatically determined gene predictions are merged into a single official geneset before submission to NCBI. This is potentially a huge loss. Or am I missing something?

i.e., Is there a database which centralizes curated gene predictions? Or a "tag" by which to identify manually curated gene predictions present in NCBI nr?

Thanks! Yannick

homology database gene-prediction similarity • 2.5k views
Entering edit mode
8.4 years ago

RefSeq, which is a part of NCBI nr provides a clear distinction between predicted and curated protein/nucleotide sequences. This information is stored in the record's accession number. The accession numbers in Refseq have a format of 2 letters + underbar + 6 digits (i.e. NM_123456). Accession numbers that begin with the letter X [XM_ (mRNA), XR_ (non-coding RNA), and XP_ (protein)] are molecule models predicted by NCBI's genome annotation pipeline. However, accession numbers that start with the letter N are curated (manually reviewed by NCBI staff or collaborators).

Entering edit mode

Thanks Andrzej - this is very helpful. I didn't realize this distinction. Do you know whether the 2 types of submissions people sequencing a new genome make would be similarly differentiated?


Login before adding your answer.

Traffic: 2136 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6