Entering edit mode
19 months ago
liorglic ★ 1.2k
I am trying to develop a procedure for assessing the reliability of proteins derived from a genome annotation analysis. One thing I'd like to do is search the annotated protein for protein domains, with the idea being that proteins containing known domains are more likely to be "reliable". I was thinking of using the InterPro DB for that, specifically InterProScan for running the search. My questions are:
- Does this idea make sense to you?
- Should I limit my search in some way? For example, maybe only search for "functional" domains (e.g. "Ribonuclease H-like superfamily", and not "Retrotransposon gag domain"), or specific member DBs. What would you recommend for this purpose?
- Are there any specific terms that I should beware of? e.g. "Domain of unknown function".
- Anything else you would add or do differently in this analysis?
Simply detecting a well known protein domain is probably not a good indication of quality of the annotation. If you extend this to comparing the protein domain composition of the annotation to known proteins then it's a form of sequence similarity measurement. If there are already known proteins for this genome or for related species, you could look more directly for sequence similarity between your annotations and previously annotated proteins. There are plenty of genome annotation papers out there, look at how they estimate quality of their annotations.