Sequence search of single variable domain against Pfam V-Set turns no results?
1
0
Entering edit mode
4.7 years ago
baverso • 0

Hello everyone,

I have the following variable domain which I clipped from a full immunoglobulin's heavy chain.

> 4KQ3:H|PDBID|CHAIN|SEQUENCE
KKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGSIIPWFGTTNYAQKFQGRVTITADESTSTAY
MELSSLRSEDTAVYYCARDSEYYFDHWGQGTLVTV

I verified this above sequence-domain agrees with Pfam's V-set family of alignments by running hmmscan, which provides an E-value of 6.3e-14 with respect to domain identity (results here).

Given the above, I searched for homologs of this sequence, executing a jackhmmer search against UniRef100 on my local machine, with success (535,422 targets above default threshold).

My problem and the reason I write to you, is when I run a search against Pfam's V-Set family of alignments. My search turns no results. Examples of the V-Set of alignments are provided below so you can see what I'm searching against:

A snippet of V-Set Full alignment fasta (just to serve as example):

>..........................................................sl
e---TA.VQI.K.PG.E.T.L..S.L.S.C..........RGS.................
GF.....-T....F......S.........S.Y........Q........V.......H.
.W..I.RQ.QTGK.......................PLE..WM...G..Y..VYTDg...
...........................sGDGY.A..A..S....fK....G..R..T...
.K....ITK.........D.N..S.I.S.......M...A....Y..L..K...L.SG..
..V.T...A.E..D..S..A.V..Y..Y.CA.......-............-------..
.---..-------rraq...........................................

So my thinking is that hmmer fails to produce results because this target database has too many deletion/insertion states. Instead, I search against the Representative Proteomes (15%) alignment. Please see this random snippet below as an example of what this search space looks like:

>----V-VTVTAQEPAYL-HC---------RIP-----------EG--------SNHMV-A
--WTRASDQ--------------------A--L-L-TA----------------------
------------GQHSFTS---DPRFQVSR-------KSDTDW-I--LIL-RR--AD-L-
S-D-TG-CYLCE-----V-----------NTE----------------

Searching against this dataset, we see significantly fewer gaps, yet my search only identified a paltry sum of 2 hits above the default threshold which leads me to believe the sparsity of the data, or the gaps/insertions/deltions are not the issue.

Questions

Does my approach of performing a search for homologs of the variable sequence against Pfam's V-set seem logical? Hmmscan did confirm this sequence belongs to it. Why did it not turn more hits?

While I have read the documentation to Hmmr, and the papers on Pfam, I can't help but wonder, am I utilizing these target sequence datasets incorrectly? How are the Pfam alignments used in research?

For further context, my research interest is in developing an evolutionary-based set of homologs on variable domains in order to infer conserved portions of sequences. Thank you for any help, guidance, or opinions!

pfam sequence alignment hmm • 882 views
ADD COMMENT
2
Entering edit mode
4.7 years ago
Mensur Dlakic ★ 27k

Most (All?) sequence search tools expect a database of ungapped fasta sequences with all capital letters.

See here how to remove gaps from your alignments. In HMMer suite there is a utility called esl-reformat that may be useful in converting your sequence from formats offered by Pfam (SELEX, Stockholm) into FASTA with all capital letters.

Here are top 2 lines from Pfam's V-set alignment saved in SELEX format:

CLM2_MOUSE/19-124          TGPGS.VSGYVGGSLRVQC.....QYS...PS..Y..KGYMKYWCRGPHD.......TTCKTIVETD.....GSEKEKR.SGPVSIRD....HASNSTITVIMED.LSEDNAGSYWCK....I......QTSFIWDSWSRDPSVSVR
CLM5_MOUSE/23-126          TGPEE.VSGQEQGSLTVQC.....RYS...SY..W..KGYKKYWCRGVPQ.......RSCDILVETD.....KSEQLVK.KNRVSIRD....NQRDFIFTVTMED.LRMSDAGIYWCG...........ITKGGPD.PMFKVNVNID

After running this command:

esl-reformat -u -o v-set.fas fasta v-set.slx

the first two sequences in v-set.fas look like this:

>CLM2_MOUSE/19-124
TGPGSVSGYVGGSLRVQCQYSPSYKGYMKYWCRGPHDTTCKTIVETDGSEKEKRSGPVSI
RDHASNSTITVIMEDLSEDNAGSYWCKIQTSFIWDSWSRDPSVSVR
>CLM5_MOUSE/23-126
TGPEEVSGQEQGSLTVQCRYSSYWKGYKKYWCRGVPQRSCDILVETDKSEQLVKKNRVSI
RDNQRDFIFTVTMEDLRMSDAGIYWCGITKGGPDPMFKVNVNID

Searching against this set of ungapped FASTA sequences should produce the results you were expecting.

Another similar tool is a Perl script reformat.pl in hh-suite.

ADD COMMENT

Login before adding your answer.

Traffic: 2447 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6