Protein Domain Prediction To Protein Domain Architecture Conversion?
1
0
Entering edit mode
9.8 years ago
anandksrao • 0

I am using HHBLITS to predict protein domains. I now want to use the collection of predicted domains for any protein in the HHBLITS output and convert it to protein domain architecture. As you can see from an example output for one protein scanned against PfamA using HHBLITS, there are multiple hits, some overlapping, and therefore conflicting - how do I go about resolving these conflicts / overlaps and on what bases do I parse such an output and convert it to a strong of protein domains, i.e. protein domain architecture?

Here is an excerpt from the HHBLITS author regarding how to solve this problem of overlap / conflict - "The probability that a pair of residues is correctly aligned is the product of the probability for the database match to be homologous (given by the values in the \verbProbab column of the hit list) times the posterior probability of the residue pair to be correctly aligned given the database match is correct in the first place. The posterior probabilities are specified by the confidence numbers in the last line of the alignment blocks. A 0 corresponds to 0-10\%, a 9 to 90-100\%. Therefore, an obvious solution is to prune the alignments in the overlapping region such that the sum of total probabilities is maximized. There is no script yet that does this automatically."

I dont have much of a clue regarding what that means, let alone how to implement this in Perl or some other language! Could someone guide me through this please?

Thanks all for your help. - AksR

No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
1 PF04379 DUF525:  Protein of un  99.3 3.2E-17 8.3E-21  123.8   0.0   87  311-409     2-88  (90)
2 PF00646 F-box:  F-box domain;   85.7   0.018 5.5E-06   32.6   0.0   47    1-47      1-47  (48)
3 PF09346 SMI1_KNR4:  SMI1 / KNR  83.5   0.025 8.3E-06   35.6   0.0   27  114-140     1-27  (130)
4 PF12937 F-box-like:  F-box-lik  78.7   0.057 1.7E-05   30.9   0.0   38   10-47      8-45  (47)
5 PF05743 UEV:  UEV domain;  Int  38.5     1.2 0.00033   30.4   0.0   43  269-312    36-85  (121)
6 PF03360 Glyco_transf_43:  Glyc  30.9       2 0.00052   32.3   0.0   47  247-293    56-102 (207)
7 PF05247 FlhD:  Flagellar trans  23.0     3.7 0.00089   28.7   0.0   39  168-206    11-50  (104)
8 PF08745 UPF0278:  UPF0278 fami  18.5     5.4  0.0013   30.5   0.0   42  112-153    61-108 (205)
9 PF09336 Vps4_C:  Vps4 C termin  18.0       5  0.0013   24.3   0.0   24  103-126    34-57  (62)
10 PF10959 DUF2761:  Protein of u  15.6     7.8  0.0017   27.0   0.0   16  376-391    54-69  (95)

protein domain • 2.4k views
0
Entering edit mode
9.8 years ago
Ketil 4.1k

I'm not familiar with this tool, so I don't know where or how you find the line with posterior probabilities. I guess he means that you can extract the non-overlapping subset of the domains with the maximum sum. This should be a straightforward recurrence, and if necessary, it can be implemented efficiently by dynamic programming. (Alternatively, you could interpret "pruning" to mean shrink the size of the domain matches, but I'm not sure matching half a domain makes a lot of sense.)

To elaborate a bit on the recurrence, it will go something like this:

max_subset [x1,x2,x3...] = max

(x1 ++ max_subset (x2,x3..etc that don't overlap x1),

max_subset [x2,x3...])

I.e. you either include the first element, and add it to the maximal subset of all elements that don't overlap it, or you take the maximal subset of all elements but the first element.