Question: Finding novel domains in a group of proteins
gravatar for crouch.k
2.6 years ago by
crouch.k30 wrote:

I have been posed the following problem by a collaborator and I could use some advice on how to approach it:

The collaborator works with a non-model organism. He is interested in how a particular protein binds to other proteins. He has done a pulldown with his protein followed by Mass Spec, and has given me the gene ids from the mass spec hits.

In model organisms, there is a well-characterised domain for binding to his protein of interest which has an entry in Pfam. I have not been able to find this domain in any of his hits - anywhere else in the genome of his organisms or indeed anywhere else in the wider clade. So the question is, is there some other domain that facilitates interactions with his protein of interest in this organism?

I have are full length amino acid sequences from around 100 proteins from the same organism. There is no indication of where a potential domain might be in each protein. It is probable that some (even many) of the proteins don't even bind the protein of interest directly as they may be part of a larger complex or just sticky.

I am used to looking for divergent members of a gene family across organisms or starting with a domain that I have some structure for, but none of the usual approaches are working here - I don't feel like I have enough information to start with.

I feel like I need some way to narrow down the search. I have tried several variations on all vs all local alignments (blastp, psiblast, jackhmmer) in an attempt to find recurrent hits but nothing useful has come out of this. Out of desperation I tried a very naive MSA approach too, which unsurprisingly didn't yield anything.

Does anyone have any bright ideas?

alignment gene genome • 734 views
ADD COMMENTlink modified 2.6 years ago by Joe18k • written 2.6 years ago by crouch.k30
gravatar for lieven.sterck
2.6 years ago by
VIB, Ghent, Belgium
lieven.sterck8.5k wrote:

One thing you might be able to try is the MEME software suite. It looks for 'conserved motifs' in unaligned sequences, both nucleotide and protein.

Just another small point: the MSA approach is not that naive I feel ;) : it is the way how many of these conserved domains are identified/constructed

ADD COMMENTlink written 2.6 years ago by lieven.sterck8.5k

Thanks! Yes I will have a play with MEME.

What I meant by the naivety of MSA is that at the moment I don't have any idea what a potential domain might look like or where it could be. I think I really need to try to refine domain boundaries first, even if not precisely, and then start looking at MSAs. In any event, the output I got with what I had was a mess!

ADD REPLYlink written 2.6 years ago by crouch.k30
gravatar for alastair.skeffington
2.6 years ago by
alastair.skeffington10 wrote:

I've recently been dealing with a similar problem: working with a non-model organism we have identified a set of proteins that bind a particular structure, and wanted to look for novel domains that might be responsible.

I agree with lieven.sterck that the MEME suite is a good place to start. Motif finding is a bit of a black art (you get rather different results with different programs), but another program I found very helpful was motifx (, now also available in an R package (

All-vs-all dotplots are also a good way to quickly see if there are some regions of similarity in your protein set eg (, and you can quickly change parameters such as window-size and the similarity matrix to check you're not missing something obvious.

I've actually written my own solution to this problem, which I hope to make available on Cyverse by the end of the year. It involves splitting the protein sequences into short windows, generating a similarity matrix, clustering based on this matrix using a method which can pull clusters out of noise, and then refining the domain boundaries. It worked extremely well in our particular case, and I'd be happy to run your data through it if you want to get in touch.

ADD COMMENTlink written 2.6 years ago by alastair.skeffington10

Thanks for the useful suggestions! I will have a play with some of these.

Thanks so much for the offer to have a go with your software too. I'll drop you an email.

ADD REPLYlink written 2.6 years ago by crouch.k30
gravatar for Joe
2.6 years ago by
United Kingdom
Joe18k wrote:

Its a computationally expensive approach, but you could try a mix of threading and ab initio structure prediction. Sequences diverge much more quickly than do structures, so you may be able to try a 'brute force' in silico protein-protein docking approach.

You may simply not be able to 'see' this domain from the sequence alone. The Phyre webserver for threading allows batch submission, so you can do some 'cheap' computation that way. I-TASSER offers a download for the software so you can run it locally if you have the resources. If the sequences are short, the computation can be quite quick.

The threading approaches are still somewhat sequence constrained, but approaches that use ab initio steps might allow you to find a mystery structure that is energetically favourable but not well represented in the sequence databases you've been querying up to now.

One final quick and dirty thing you can do is run all the sequences through InterProScan which would give you an idea of where any domains appear in the sequence, which you might be able to use to guide your MSA gazing.

ADD COMMENTlink written 2.6 years ago by Joe18k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 933 users visited in the last hour