Question: Aligning Copies Of A Protein Domain To A Pfam Model (HMM)
0
gravatar for UnivStudent
3.7 years ago by
UnivStudent380
Canada
UnivStudent380 wrote:

I'm wondering if anyone out there has any experience aligning copies of a protein domain to it's Pfam domain using the HMM. Currently I'm aligning large numbers (1 - 100k) of sequences with Clustal Omega and supplying the Pfam using the --hmm-in parameter to guide the alignment, however I'm unhappy with the resulting alignments, as they have a large number of gap positions.

There is no gap opening penalty to explicitly set in clustal, but from the clustal omega README it appears that the way to reduce gaps would be to increase the number of iterations of alignment. Currently I've been using the --iter parameter but I'm wondering if anyone has had positive results setting the max hmm iterations (--max-hmm-iterations), and --max-guidetree-iterations seperately instead?

If I had to summarize my question I'm basically asking:

  1. Are there any other programs out there to do external profile alignment?
  2. Does anyone know the optimal way to use iterations in clustal omega to reduce the number of gaps in an alignment? Is it the guide tree iterations that care causing the gaps to be introduced?
  3. How would you evaluate how much alignments improve with iterations other than visual inspection? Does anyone know more objective criteria to benchmark with?

Thanks in advance Biostars!

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by UnivStudent380

I don't have experience with this. But tcoffee gives quite good alignments. The manual does suggest that a profile alignment like clustal is possible. See 'Aligning Profiles' section of the manual. It has a multitude of options including an alignment evaluation option.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by microfuge1.6k

It looks like instead of the HMM you'd have to input the seed alignment used to create the HMM, but it's worth a shot.

UPDATE: This method does not scale well, and can't be used with large numbers of sequences (~60k) even with multiple cores and lots of memory.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by UnivStudent380
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 874 users visited in the last hour