11 months ago
chscho • 0

Hi all

I'm confused about the definition of "hypothetical" in PGAP RefSeq genome annotations... So far I assumed, that a hypothetical protein has only evidence for a ORF, but no experimental evidence for a protein product so far. While scrolling through the Genebank-File of E.coli K-12 substr. MG1655 (NC_000913) I've stumbled over 5 entries named "hypothetical protein". So I was wondering how did these entries made it into the RefSeq annotation? Having a closer look at one of these 5 entries, I was even more puzzled: gene: "uraA" (2618871..2620160) has a protein product labeled as "hypothetical protein", but also a protein_id: NP_416992.1, as well as references to Swissprot (P0AGM7) and many other databases. So I assumed, that this protein is still a predicted ORF, without experimental evidence. But looking at Swissprot entry P0AGM7, I find a entry with experimental evidence at protein level (annotation score 5 of 5). How do I have to read this?

Thank you all already in advance for helping me understand, whats going on here...

chscho

This seems rather confusing. My guess would be that NCBI's original record is not updated with the latest annotation information that you are able to see on the external protein links.

COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to final
            NCBI review. The reference sequence is identical to U00096.
            On Nov 3, 2013 this sequence version replaced NC_000913.2.
            Changes to proteins and annotation made on September 24, 2018.
            Current U00096 annotation updates are derived from EcoCyc
    Suggestions for updates can be sent to
   These updates are being generated from a
            collaboration  that includes EcoCyc, the University of Wisconsin,
            UniProtKB/Swiss-Prot, and the National Center for Biotechnology
            Information (NCBI).
True, but why would an entry starting with "U" (what are these entries anyways?) be preferentially updated over a reference genome entry with an "NC" accession? Furthermore - according to UniProt - was the protein structure (x-ray crystallography) already solved back in 2011... ...and the fact, that this "hypothetical protein" has a "NP_" accession is really bugging me, because if you perform a proteogenomic experiment to identify novel proteins (e.g. with a six-frame translation database) the definition of "novel" protein (= not found in RefSeq) is starting to fall appart...


