Question

Regrading the length of hypothetical pritein

0

Entering edit mode

5.9 years ago

jot87c • 0

WGS project provide a lot of data. In NCBI there are many organism's proteome data and most of the proteins are hypothetical. But the length of protein range from 50 to 13000 aa length. During Literature search I have found that in most of the research paper hypothetical proteins are randomly selected and are annotated. I want to annotate all hypothetical proteins of particular pathogen but many hypothetical proteins are ranging from the length of 50 to 200 aa. What should be the appropriate length of hypothetical proteins that can further annotate. 150 AA or >200 AA???.

hypothetical protein annotation • 1.1k views

ADD COMMENT • link 5.9 years ago by jot87c • 0

0

Entering edit mode

Thank You so much JRJ.Healey

Actually I have downloaded hypothetical protein data of protozoan from NCBI and total hypothetical protein of the protozoan is about 4500. Out of 4500, 628 proteins are less than 50 aa ranging from 33 aa to 49 aa. There can be small length proteins but 628 proteins ???.

Randomly selected means, During Literature search I have found that in most of research paper regarding hypothetical protein annotation, the length of protein was not mentioned and hypothetical proteins were randomly selected for annotation i,e. without mentioning the length of proteins.

ADD REPLY • link 5.9 years ago by jot87c • 0

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

This comment belongs under @jrj.healey's answer.

ADD REPLY • link 5.9 years ago by GenoMax 141k

0

Entering edit mode

Sorry to contribute to messing up the organisation of the thread, but i'll try to keep the comments all together at least.

My suggestion to you @jot87c, would be that you do some manual curation of the proteins first to see if you trust them or not. For sure, some of them will probably be false positives, but you should do something like a low stringency PSI-BLAST or similar to see if there are any very well known similarities to the proteins to tell you if they're likely to be real or not.

Stop focussing on just the length of the protein - it's not really that useful much of the time - you need to be cleverer.

You also need to consider what question you're asking. Does it actually matter if a small fraction of your proteins are false positives? What's the actual research aim?

ADD REPLY • link 5.9 years ago by Joe 21k

0

Entering edit mode

As far as I have understood, you have sequences

of several or even many hypothetical proteins from some pathogen?

To annotate them you need to find their known orthologous proteins.

Forget for some time about their length or amounts.

Try the following orthologous database:

https://omabrowser.org/oma/home/

Change 'IDENTIFIER' to 'PROTEIN SEQUENCE'

Insert any of your protein sequences

Run it, and OMA will give you some homologous proteins with close sequences.

Some of them will be real proteins you will be able to study.

Then try other proteins.

Hopefully it may help.

ADD REPLY • link 5.9 years ago by natasha.sernova ★ 4.0k

score 2 · Answer 1 · 2018-05-17

Hypothetical proteins in annotated genomes are detected with algorithms of varying degrees of sophistication. I don’t know what you mean by “randomly selected”.

Their lengths are often already taken in to consideration so you don’t really need to filter by length. It’s rarer, but there are some very short proteins, shorter even than 50 amino acids. There are also some colossal proteins, so they could easily be valid lengths.

If you know your bacteria of interest well, maybe you can judiciously throw out some extreme proteins...

No point throwing away data unless you have to though.