Filtering HMMSCAN Domtblout Output when using Metagenomic Sequence as Input
1
0
Entering edit mode
7 weeks ago
Jane123 • 0

I am in the process of mapping my metagenomic sequence reads to multiple profile HMM models. I am struggling to find the proper values to filter them and configure the threshold. At this point, I tried some values as a threshold for both e-value and score but I am not sure what exactly yields a good result. I have been searching for literature related to the proper threshold values for metagenomic sequences as I am only seeing those for specific genomes. Would be really grateful for anyone's guidance.

hmmscan sequences domtblout metagenomic • 609 views
ADD COMMENT
1
Entering edit mode
7 weeks ago
Mensur Dlakic ★ 29k

I have been searching for literature related to the proper threshold values for metagenomic sequences as I am only seeing those for specific genomes.

There is no different threshold for metagenomic sequences compared to other sequences. The way this is done depends on what you are trying to achieve. If you just want to find related sequences, using E=0.001 as a cut-off will do the trick. That doesn't necessarily mean that a sequence match with that value is functionally identical, but it will be related. If you want to find orthologs, smaller E-values will be needed, and those cut-offs are variable for each protein family.

No matter what, using bit-score thresholds is not a good idea.

ADD COMMENT
0
Entering edit mode

I see! Thank you so much for this. Although just to clarify, my main objective is to identify sequence matches using HMMScan between metagenomic sequence reads and recombinase protein families in profile HMMS for composition profiling, so in this regard, an E-value of E=0.001 should suffice right?

ADD REPLY
0
Entering edit mode

There is a big difference between searching metagenomic sequences and metagenomic sequence reads. I really don't know how to help you with the latter, nor do I understand why you would want to search directly through the sequencing reads. What do you do: translate all the reads in 6 reading frames and then search against that huge database of peptides?

Is there anything preventing you from assembling those reads, predicting the genes and proteins for the whole assembly, then doing a proper profile HMM search? That will increase the sensitivity many orders of magnitude. I assume even if you find some recombinase hits using your approach, you would eventually want to recover their whole sequences. That will get you back to the assembly step, so you might as well do this properly.

I did a quick search against a random assembled metagenome on my computer using PF07508.hmm and found 344 hits with E < 1e-10. That was a "small" metagenome with only about 1.6 million total proteins, and the search took about 1 second. A larger metagenome (~19 million proteins) took a whole 9 seconds and identified 1530 good matches.

ADD REPLY
0
Entering edit mode

Currently, this is indeed what I am doing "translate all the reads in 6 reading frames and then search against that huge database of peptides". I am having reservations about whether an assembly approach is actually needed since I am only doing a simple profiling (compositional analysis). I am trying to make this as less computationally extensive as possible.

ADD REPLY
0
Entering edit mode

I am trying to make this as less computationally extensive as possible.

With all due respect, you are already wasting more time by asking for help and waiting for responses than what would take to assemble. Very large metagenomes take maybe a day or two to assemble, but will give you an unambiguous collection of proteins. I am guessing you already spent more than a day using this "less computationally extensive" approach.

Besides, how will you get the same information as I outlined above using your approach? Let's say that you have 5000 pieces of recombinases in your analysis. Does that mean a total of 3000 or 500 recombinases in your metagenome? Some of those short hits will belong to the same protein, and others won't. How will you figure out that part without assembling them? Finally, many recombinases have domains (helix-turn-helix DNA binding domain comes to mind) that are present in other proteins, so from these short peptides you will likely get a large number of false positives that just happen to have those shared domains.

ADD REPLY

Login before adding your answer.

Traffic: 2832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6