Question: Regrading the length of hypothetical pritein
gravatar for jot87c
13 months ago by
jot87c0 wrote:

WGS project provide a lot of data. In NCBI there are many organism's proteome data and most of the proteins are hypothetical. But the length of protein range from 50 to 13000 aa length. During Literature search I have found that in most of the research paper hypothetical proteins are randomly selected and are annotated. I want to annotate all hypothetical proteins of particular pathogen but many hypothetical proteins are ranging from the length of 50 to 200 aa. What should be the appropriate length of hypothetical proteins that can further annotate. 150 AA or >200 AA???.

ADD COMMENTlink modified 13 months ago • written 13 months ago by jot87c0

Thank You so much JRJ.Healey

Actually I have downloaded hypothetical protein data of protozoan from NCBI and total hypothetical protein of the protozoan is about 4500. Out of 4500, 628 proteins are less than 50 aa ranging from 33 aa to 49 aa. There can be small length proteins but 628 proteins ???.

Randomly selected means, During Literature search I have found that in most of research paper regarding hypothetical protein annotation, the length of protein was not mentioned and hypothetical proteins were randomly selected for annotation i,e. without mentioning the length of proteins.

ADD REPLYlink written 13 months ago by jot87c0

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

This comment belongs under @jrj.healey's answer.

ADD REPLYlink written 13 months ago by genomax68k

Sorry to contribute to messing up the organisation of the thread, but i'll try to keep the comments all together at least.

My suggestion to you @jot87c, would be that you do some manual curation of the proteins first to see if you trust them or not. For sure, some of them will probably be false positives, but you should do something like a low stringency PSI-BLAST or similar to see if there are any very well known similarities to the proteins to tell you if they're likely to be real or not.

Stop focussing on just the length of the protein - it's not really that useful much of the time - you need to be cleverer.

You also need to consider what question you're asking. Does it actually matter if a small fraction of your proteins are false positives? What's the actual research aim?

ADD REPLYlink written 13 months ago by jrj.healey13k

As far as I have understood, you have sequences

of several or even many hypothetical proteins from some pathogen?

To annotate them you need to find their known orthologous proteins.

Forget for some time about their length or amounts.

Try the following orthologous database:


Insert any of your protein sequences

Run it, and OMA will give you some homologous proteins with close sequences.

Some of them will be real proteins you will be able to study.

Then try other proteins.

Hopefully it may help.

ADD REPLYlink modified 13 months ago • written 13 months ago by natasha.sernova3.5k
gravatar for jrj.healey
13 months ago by
United Kingdom
jrj.healey13k wrote:

Hypothetical proteins in annotated genomes are detected with algorithms of varying degrees of sophistication. I don’t know what you mean by “randomly selected”.

Their lengths are often already taken in to consideration so you don’t really need to filter by length. It’s rarer, but there are some very short proteins, shorter even than 50 amino acids. There are also some colossal proteins, so they could easily be valid lengths.

If you know your bacteria of interest well, maybe you can judiciously throw out some extreme proteins...

No point throwing away data unless you have to though.

ADD COMMENTlink written 13 months ago by jrj.healey13k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1006 users visited in the last hour