Find Homologous Proteins That Contain A Domain Of Interest, And Build A Phylogeny
Entering edit mode
8.8 years ago

Hi biostars,

I want to study a particular domain of a protein (in my case, a fungal protein). Basically, what I need is:

  • A high-quality multiple sequence alignment (MSA) of this domain with (as many as possible, but also as distant as possible) homologs.
  • The phylogenetic relationship between such homologs, also with decent quality.

Here it goes what I've done so far.

  1. I query the domain in a protein-search tool such as BLAST or HHBLITS.
  2. I fetch the full sequences of the hits.
  3. I build a phylogeny using a multiple sequence alignment of the full sequences: Phylogeny done.
  4. I use JackHMMER to iteratively build an HMM profile of my domain, then align the full sequences with this profile, and eventually chop the region of interest: Alignment done.

I don't now if this makes sense, and I have to admit that, not being this my area of expertise, I'm having a hard time establishing my working set of sequences. What are your standards? How would you address this problem?


phylogeny domain • 3.2k views
Entering edit mode

If you are looking for (full-length) homologous proteins (rather than just the conserved domain), _and_ you don't have a larg(ish) set of curated examples, you _may_ be better off using sensitive alignments instead of building HMMs. If you want to experiment with transalign ( for this, please drop me an email.

Entering edit mode
8.8 years ago
Josh Herr 5.7k

I see this question as an extension of the question you just asked and it sounds like you are on the right track.

One thing I would comment on regarding your order of operations above, I would never compute a phylogenetic tree until after you have an alignment, and in my opinion, a curated alignment. I also think, during curation, that every alignment should be edited by eye.

Alignment programs are far from perfect (so are people when doing alignments, computers are a lot faster, but are also error prone). I see a lot of pipeline algorithms designed to go directly from BLAST, to CLUSTAL W or MUSCLE, then automatic trimming of sequences (what you are calling "chop the region of interest", correct?), and then go right into constructing a phylogeny without looking at the alignment. I think this is a recipe for danger.

With what I said above, I think you really need to grasp the diversity in your homologous domain in order to know how to inspect the alignment and choose which sequences show synteny within that alignment. Some of your BLAST hits to your domain of interest could be due to chance (have you decided on a e-value cut off?) or to convergent evolution. Also, you study fungi, like I do, and we are becoming more and more aware of how prevalent horizontal gene transfer events are in the fungi. The process of understanding homology is subtle and time consuming. If you need to read up on homology and phylogenetic theory, I may suggest the new book by Baum & Smith: Tree Thinking.

My standards are: Determine if the sequences are truly homologous. My answer here is long because this is not a trivial exercise.

My strategy to address the problem is to first take my domain of interest and BLAST it to my database of interest (are you just interested in the fungi or in all organisms?). I then align the sequences using one of many programs that I choose (for me this may mean all hits to all sequenced fungal genomes, or all the sequences in NCBI, or a single genus). I edit my alignment in a text editor to check for obvious homologies, mis-placed aligned regions, etc., etc. I construct a phylogenetic tree in lots of different ways and using lots of different methods and programs (see Joe Felsenstein's list here and my recent opinion here) according to how much time I have and the stage of the project. I look at the tree, maybe do another BLAST, add sequences, remove sequences, look for psuedogenes, clades showing long branch attraction, etc., and continue to the process. A phylogeny is never perfect, it's an estimate and a hypothesis, but some phylogenies are better than others.

Entering edit mode
8.8 years ago
cdsouthan ★ 1.8k

My strategy, for starters at least, is somewhat orthogonal to Josh's. If InterPro does not have your domain (what is it BTW?) it might not be worth bothering with in the first place. Assuming it does, the (domain) homology relationships are pre-cooked by the InterPro scan pipe for you both in UniProt and Ensembl (but not formally synched) so you can just pull them out (even via SRS). Analogously, you can find pre-cooked family (full-length) relationships in Pfam, TreeFam and Ensembl (and other sources). Checking these first could save a lot of work and you can then move on to hand-crafting alignments and bringing in newer members as Josh indicates. Crucial for you is to determine if the domain has been "shuffled" between non-homologous families and/or if the families include HGT events.


Login before adding your answer.

Traffic: 2029 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6