WGS project provide a lot of data. In NCBI there are many organism's proteome data and most of the proteins are hypothetical. But the length of protein range from 50 to 13000 aa length. During Literature search I have found that in most of the research paper hypothetical proteins are randomly selected and are annotated. I want to annotate all hypothetical proteins of particular pathogen but many hypothetical proteins are ranging from the length of 50 to 200 aa. What should be the appropriate length of hypothetical proteins that can further annotate. 150 AA or >200 AA???.
Hypothetical proteins in annotated genomes are detected with algorithms of varying degrees of sophistication. I don’t know what you mean by “randomly selected”.
Their lengths are often already taken in to consideration so you don’t really need to filter by length. It’s rarer, but there are some very short proteins, shorter even than 50 amino acids. There are also some colossal proteins, so they could easily be valid lengths.
If you know your bacteria of interest well, maybe you can judiciously throw out some extreme proteins...
No point throwing away data unless you have to though.