We are seeking to annotate nitrogen fixing genes within metagenomes similar to how Hotpep and dbCAN do for carbohydrate active enzymes.
First and foremost is there a premade tool capiable of doing this? I have searched but not been able to find anything.
If not is if a viable option to generate a profile HMM database for nitrogen fixing genes scrapped from UniProtKB and then use HMMER to annotate them within our metagenomes?
We would like to combine this with using the same protein sequences from Uniprot and create a DIAMOND database so use this tool for the same purpose of annotation nitrogen fixing genes. We would then cross reference the two tools and take hits that appear in both.
If this methodology valid and is uniprot a good place to go in order to scrape the protein sequences for a database?
Hi Michael,
Thank you for the detailed reply. Is
AmigiGO
a typo perhaps, as searching for it does not yield a GO annotation style tool?Of the approaches you mentioned which do you feel would be best to capture the entire nitrogen fixing genes landscape of a metagenome?
Lastly on the EBI page you linked there are 594 results returned. Am I correct as to interpret this as 594 different nitrogenous enzyme models where as the ~443k results in UniProtKB are an entry from a specific bacteria?
Dear Robert, I meant AmiGO, I have corrected the typo.
To get a complete picture of the nitrogen fixing genes, I would rank the methods as follows (top = most comprehensive)
You might even combine 1. & 3. Unfortunately, running a full InterProScan will require a cluster or multiple CPUs and most likely takes longest. Which method to choose depends on what else you want to do with the annotation.
If you intend to do the identification of only your gene family of interest it might be best to take a DIY approach to build a custom InterProScan pipeline: identify the databases in the InterProScan installation and replace all tool-specific databases with custom databases that only contain the models of interest. That would speed up the search, but I haven't tried it, and it would require some more in-depth knowledge about the different database formats.
Finally, I retrieved 194 results for my search "nitrogenase". Those are models of different proteins or domains and extracted from different databases. UniProtKB entries contain a single AA sequence or isoforms, normally from a single species.
Thank you i will attempt your suggestions. Thankfully I have access to a HPC so can run the full InterProScan without issue. Why is InterProScan is so sensitive, do tools such a Hotpep have popularity for identifying carbohydrate active enzymes?