I am in the process of mapping my metagenomic sequence reads to multiple profile HMM models. I am struggling to find the proper values to filter them and configure the threshold. At this point, I tried some values as a threshold for both e-value and score but I am not sure what exactly yields a good result. I have been searching for literature related to the proper threshold values for metagenomic sequences as I am only seeing those for specific genomes. Would be really grateful for anyone's guidance.
I see! Thank you so much for this. Although just to clarify, my main objective is to identify sequence matches using HMMScan between metagenomic sequence reads and recombinase protein families in profile HMMS for composition profiling, so in this regard, an E-value of E=0.001 should suffice right?
There is a big difference between searching
metagenomic sequences
andmetagenomic sequence reads
. I really don't know how to help you with the latter, nor do I understand why you would want to search directly through the sequencing reads. What do you do: translate all the reads in 6 reading frames and then search against that huge database of peptides?Is there anything preventing you from assembling those reads, predicting the genes and proteins for the whole assembly, then doing a proper profile HMM search? That will increase the sensitivity many orders of magnitude. I assume even if you find some recombinase hits using your approach, you would eventually want to recover their whole sequences. That will get you back to the assembly step, so you might as well do this properly.
I did a quick search against a random assembled metagenome on my computer using
PF07508.hmm
and found 344 hits withE < 1e-10
. That was a "small" metagenome with only about 1.6 million total proteins, and the search took about 1 second. A larger metagenome (~19 million proteins) took a whole 9 seconds and identified 1530 good matches.Currently, this is indeed what I am doing "translate all the reads in 6 reading frames and then search against that huge database of peptides". I am having reservations about whether an assembly approach is actually needed since I am only doing a simple profiling (compositional analysis). I am trying to make this as less computationally extensive as possible.
With all due respect, you are already wasting more time by asking for help and waiting for responses than what would take to assemble. Very large metagenomes take maybe a day or two to assemble, but will give you an unambiguous collection of proteins. I am guessing you already spent more than a day using this "less computationally extensive" approach.
Besides, how will you get the same information as I outlined above using your approach? Let's say that you have 5000 pieces of recombinases in your analysis. Does that mean a total of 3000 or 500 recombinases in your metagenome? Some of those short hits will belong to the same protein, and others won't. How will you figure out that part without assembling them? Finally, many recombinases have domains (helix-turn-helix DNA binding domain comes to mind) that are present in other proteins, so from these short peptides you will likely get a large number of false positives that just happen to have those shared domains.