Question

NCBI - incomplete protein annotations

0

Entering edit mode

15 months ago

timothy.kirkwood ▴ 140

Hello,

I've been looking at some genomes on NCBI (Nucleotide database) and have come across some incomplete CDS annotations. However, I'm slightly confused as to how they got picked up, given they seem to lack start/stop codons. For example, the CDS's with locus tags SD37_RS10635 and RS41900 in GCF_000943515.2 (NZ_CP016174) don't seem to have a start or stop codon - they begin/end with tgc/cgc and ccc/gaa respectively. When I visualise them in snapgene (arrows = predicted CDS), they look like this:

When I used BLASTX it seemed to corroborate the annotated sequences for these proteins, to the degree that BLAST hits align along the annotated region and not the snapgene-predicted CDS regions that aren't part of the annotated CDS. However, I'm not sure how convincing this is - it says the annotated region is most similar to other BLAST database entries, but presumably the query could simply be a novel version of a database protein.

In other cases (where the protein was in a biosynthetic gene cluster/BGC and could be compared to homologs in homologous BGCs with the same synteny) it looks like the annotation is wrong and one protein has been broken into several bits. For example, SD37_RS42595 looks like it should be part of a bigger protein (orange arrow), and when I blast that I get the protein I would expect (a PKS) with complete query coverage:

SD37_RS42595

Can anyone answer (i) why CDS without start and stop codons are being annotated and (ii) how far they would trust the annotations? How would you normally deal with this - would you take the annotated sequence as the true protein, or extend/truncate the annotated protein either end until it matches the snapgene CDS for the annotated protein?

Cheers!

ncbi blast refseq • 1.3k views

ADD COMMENT • link 15 months ago by timothy.kirkwood ▴ 140

1

Entering edit mode

I don't know the exact answer to either of your questions.

Yet consider this: there are 3 stop codons out of 64 that are available. That means, in simplest statistical terms, that any open reading frame (ORF) longer than 21 codons could be considered real. It is not as simple as that as the codon usage is not uniform, yet ORFs considerably longer than 21 codons are unlikely to happen. This becomes increasingly unlikely the longer the ORF is. So they could have some kind of a cut-off where any ORF longer than say 100 codons is automatically annotated, whether it has start and stop codons or not.

ADD REPLY • link 15 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Hi Mensur, thanks for the reply. Would the biological assumption here be that the CDS is using a start/stop codon of which we are unaware? Because I'm not sure how it could be considered real without these features.

ADD REPLY • link 15 months ago by timothy.kirkwood ▴ 140

2

Entering edit mode

timothy.kirkwood the best option is to send this question in to NCBI help desk. As Mensur Dlakic points out NCBI could be using a set criteria to call these in their annotation pipeline. They may be documented publicly (I could not easily find them but that does not mean they are not there) and if not you will get an official answer.

ADD REPLY • link 15 months ago by GenoMax 141k

0

Entering edit mode

Thanks GenoMax, I followed your advice and added the NCBI reply to my post in case other people are interested.

ADD REPLY • link 15 months ago by timothy.kirkwood ▴ 140

score 1 · Accepted Answer · 2023-01-19

I asked NCBI help desk and here is the reply, for anyone else with similar concerns:

Dear Tim,

You ask some very shrewd questions. NCBI is continuing to refine the structure prediction algorithm within PGAP (Prokaryotic Genome Annotation Pipeline). Working as a biocurator with a deep interest in finding ways to improve PGAP, collaborating with our software developers, I have been trying to find ways to suggest improvements.

As a rule, our detection of coding region features (CDS) with frameshifts, but either mutation or sequencing/assembly artifact, is excellent, and our labeling of features as "/pseudo" is very likely to be correct. For certain cases of programmed frameshift (release factor 2, or IS element transposases), we identify the programmed nature of the frameshift and predict a full length real protein correctly. Our detection of "/pseudo" features with internal stop codons is also quite good.

We have greater difficulties distinguishing pseudogenes with truncated coding regions from valid real proteins that simply have novel domain architecture. We feel the algorithm works quite well for some heavily studied lineages, including E. coli, Salmonella, and other Gram-negative pathogens from the Enterobacteriaceae. PGAP has more difficulty in GC-rich taxa such as Streptomyces or Amycolatopsis. Changes over the last couple of years have shown improved preservation of long, multidomain proteins such as NRPS and PKS, but clearly some problems remain.

I will perform a more detailed analysis of our structural annotation for the Amycolatopsis, in order both to report to you how reliable our "partial-in-the-middle" pseudogene reports are in this species, and to document for our developers what the failure mode looks like, and how we might fix it, when our assertion that a protein is "/pseudo" is most probably in error. One option we are exploring is making algorithmic changes, including setting a minimum threshold of percent identity for allowing PGAP to judge that a different in architecture represents a degraded gene rather than simply an alternative architecture. Another is an expansion of the set of proteins used for identification of homologs informative on gene structure.

I'd like to thank you for writing to us, as feedback on the concerns of experienced users assist us in finding and implementing ways of improving our pipeline.

EDIT - they sent a follow up email:

Once PGAP decides that a feature is a pseudogene, not a functional gene with start and stop codons in the expected locations, the obligation to run from start to stop is dropped. We try to use homology to available proteins or HMMs to estimate the size of the pseudogene feature.

PGAP uses GeneMark S2+ for ab initio prediction, by the way, not Snapgene (a tool I do not use).

I attached a file showing my results from a very detailed review. The purpose was dual, partly to answer your questions and partly to continue work I have been doing, analyzing PGAP's performance and looking for ways to improve accuracy. We can expect accuracy to improve over time as we

add new protein family models, especially protein profile hidden Markov models (HMMs)
expand the set of reference proteins in our "Naming Set", currently about 14,000,000 proteins, most of which date from a collection built more than 8 years ago.
make adjustments to the structural annotation algorithm in PGAP.

"pseudo=true" assertions made based on frameshifts and internal stop codons proved very reliable in the Amycolatopsis genome.

"pseudo=true" assertions made based on "partial-in-the-middle" findings were somewhat less reliable. Examples seen included known problems such as

too low a tolerance for tail-to-tail overlaps in GC-rich organisms
partial-in-the-middle determinations based on homologs distant enough that domain architectures might be expected to differ

While we are working to address these issues, my finding is that pseudogene assignments are mostly correct, very much so for frameshifts and internal stops, but true even for most apparent truncations.

I recognize that for any one genome, it could be advantageous to see every protein translation that might be real. For NCBI, enabling that would create some costs. To instantiate every possible translation as a protein would flood databases of unique proteins with multiple broken forms for every one proper form. PGAP plays a gatekeeper role, and occasionally stops the prediction of a real protein. We are very much aware this happens and constantly looking for ways to improve our discrimination of real from pseudo.