Hello everyone,
I have the following variable domain which I clipped from a full immunoglobulin's heavy chain.
> 4KQ3:H|PDBID|CHAIN|SEQUENCE
KKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGSIIPWFGTTNYAQKFQGRVTITADESTSTAY
MELSSLRSEDTAVYYCARDSEYYFDHWGQGTLVTV
I verified this above sequence-domain agrees with Pfam's V-set family of alignments by running hmmscan, which provides an E-value of 6.3e-14 with respect to domain identity (results here).
Given the above, I searched for homologs of this sequence, executing a jackhmmer search against UniRef100 on my local machine, with success (535,422 targets above default threshold).
My problem and the reason I write to you, is when I run a search against Pfam's V-Set family of alignments. My search turns no results. Examples of the V-Set of alignments are provided below so you can see what I'm searching against:
A snippet of V-Set Full alignment fasta (just to serve as example):
>..........................................................sl
e---TA.VQI.K.PG.E.T.L..S.L.S.C..........RGS.................
GF.....-T....F......S.........S.Y........Q........V.......H.
.W..I.RQ.QTGK.......................PLE..WM...G..Y..VYTDg...
...........................sGDGY.A..A..S....fK....G..R..T...
.K....ITK.........D.N..S.I.S.......M...A....Y..L..K...L.SG..
..V.T...A.E..D..S..A.V..Y..Y.CA.......-............-------..
.---..-------rraq...........................................
So my thinking is that hmmer fails to produce results because this target database has too many deletion/insertion states. Instead, I search against the Representative Proteomes (15%) alignment. Please see this random snippet below as an example of what this search space looks like:
>----V-VTVTAQEPAYL-HC---------RIP-----------EG--------SNHMV-A
--WTRASDQ--------------------A--L-L-TA----------------------
------------GQHSFTS---DPRFQVSR-------KSDTDW-I--LIL-RR--AD-L-
S-D-TG-CYLCE-----V-----------NTE----------------
Searching against this dataset, we see significantly fewer gaps, yet my search only identified a paltry sum of 2 hits above the default threshold which leads me to believe the sparsity of the data, or the gaps/insertions/deltions are not the issue.
Questions
Does my approach of performing a search for homologs of the variable sequence against Pfam's V-set seem logical? Hmmscan did confirm this sequence belongs to it. Why did it not turn more hits?
While I have read the documentation to Hmmr, and the papers on Pfam, I can't help but wonder, am I utilizing these target sequence datasets incorrectly? How are the Pfam alignments used in research?
For further context, my research interest is in developing an evolutionary-based set of homologs on variable domains in order to infer conserved portions of sequences. Thank you for any help, guidance, or opinions!