Question

Sequence search of single variable domain against Pfam V-Set turns no results?

0

Entering edit mode

4.7 years ago

baverso • 0

Hello everyone,

I have the following variable domain which I clipped from a full immunoglobulin's heavy chain.

> 4KQ3:H|PDBID|CHAIN|SEQUENCE
KKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGSIIPWFGTTNYAQKFQGRVTITADESTSTAY
MELSSLRSEDTAVYYCARDSEYYFDHWGQGTLVTV

I verified this above sequence-domain agrees with Pfam's V-set family of alignments by running hmmscan, which provides an E-value of 6.3e-14 with respect to domain identity (results here).

Given the above, I searched for homologs of this sequence, executing a jackhmmer search against UniRef100 on my local machine, with success (535,422 targets above default threshold).

My problem and the reason I write to you, is when I run a search against Pfam's V-Set family of alignments. My search turns no results. Examples of the V-Set of alignments are provided below so you can see what I'm searching against:

A snippet of V-Set Full alignment fasta (just to serve as example):

>..........................................................sl
e---TA.VQI.K.PG.E.T.L..S.L.S.C..........RGS.................
GF.....-T....F......S.........S.Y........Q........V.......H.
.W..I.RQ.QTGK.......................PLE..WM...G..Y..VYTDg...
...........................sGDGY.A..A..S....fK....G..R..T...
.K....ITK.........D.N..S.I.S.......M...A....Y..L..K...L.SG..
..V.T...A.E..D..S..A.V..Y..Y.CA.......-............-------..
.---..-------rraq...........................................

So my thinking is that hmmer fails to produce results because this target database has too many deletion/insertion states. Instead, I search against the Representative Proteomes (15%) alignment. Please see this random snippet below as an example of what this search space looks like:

>----V-VTVTAQEPAYL-HC---------RIP-----------EG--------SNHMV-A
--WTRASDQ--------------------A--L-L-TA----------------------
------------GQHSFTS---DPRFQVSR-------KSDTDW-I--LIL-RR--AD-L-
S-D-TG-CYLCE-----V-----------NTE----------------

Searching against this dataset, we see significantly fewer gaps, yet my search only identified a paltry sum of 2 hits above the default threshold which leads me to believe the sparsity of the data, or the gaps/insertions/deltions are not the issue.

Questions

Does my approach of performing a search for homologs of the variable sequence against Pfam's V-set seem logical? Hmmscan did confirm this sequence belongs to it. Why did it not turn more hits?

While I have read the documentation to Hmmr, and the papers on Pfam, I can't help but wonder, am I utilizing these target sequence datasets incorrectly? How are the Pfam alignments used in research?

For further context, my research interest is in developing an evolutionary-based set of homologs on variable domains in order to infer conserved portions of sequences. Thank you for any help, guidance, or opinions!

pfam sequence alignment hmm • 882 views

ADD COMMENT • link updated 4.7 years ago by GenoMax 142k • written 4.7 years ago by baverso • 0

score 2 · Accepted Answer · 2019-08-24

Most (All?) sequence search tools expect a database of ungapped fasta sequences with all capital letters.

See here how to remove gaps from your alignments. In HMMer suite there is a utility called esl-reformat that may be useful in converting your sequence from formats offered by Pfam (SELEX, Stockholm) into FASTA with all capital letters.

Here are top 2 lines from Pfam's V-set alignment saved in SELEX format:

CLM2_MOUSE/19-124          TGPGS.VSGYVGGSLRVQC.....QYS...PS..Y..KGYMKYWCRGPHD.......TTCKTIVETD.....GSEKEKR.SGPVSIRD....HASNSTITVIMED.LSEDNAGSYWCK....I......QTSFIWDSWSRDPSVSVR
CLM5_MOUSE/23-126          TGPEE.VSGQEQGSLTVQC.....RYS...SY..W..KGYKKYWCRGVPQ.......RSCDILVETD.....KSEQLVK.KNRVSIRD....NQRDFIFTVTMED.LRMSDAGIYWCG...........ITKGGPD.PMFKVNVNID

After running this command:

esl-reformat -u -o v-set.fas fasta v-set.slx

the first two sequences in v-set.fas look like this:

>CLM2_MOUSE/19-124
TGPGSVSGYVGGSLRVQCQYSPSYKGYMKYWCRGPHDTTCKTIVETDGSEKEKRSGPVSI
RDHASNSTITVIMEDLSEDNAGSYWCKIQTSFIWDSWSRDPSVSVR
>CLM5_MOUSE/23-126
TGPEEVSGQEQGSLTVQCRYSSYWKGYKKYWCRGVPQRSCDILVETDKSEQLVKKNRVSI
RDNQRDFIFTVTMEDLRMSDAGIYWCGITKGGPDPMFKVNVNID

Searching against this set of ungapped FASTA sequences should produce the results you were expecting.

Another similar tool is a Perl script reformat.pl in hh-suite.