Question

Finding novel domains in a group of proteins

1

Entering edit mode

7.4 years ago

crouch.k ▴ 30

I have been posed the following problem by a collaborator and I could use some advice on how to approach it:

The collaborator works with a non-model organism. He is interested in how a particular protein binds to other proteins. He has done a pulldown with his protein followed by Mass Spec, and has given me the gene ids from the mass spec hits.

In model organisms, there is a well-characterised domain for binding to his protein of interest which has an entry in Pfam. I have not been able to find this domain in any of his hits - anywhere else in the genome of his organisms or indeed anywhere else in the wider clade. So the question is, is there some other domain that facilitates interactions with his protein of interest in this organism?

I have are full length amino acid sequences from around 100 proteins from the same organism. There is no indication of where a potential domain might be in each protein. It is probable that some (even many) of the proteins don't even bind the protein of interest directly as they may be part of a larger complex or just sticky.

I am used to looking for divergent members of a gene family across organisms or starting with a domain that I have some structure for, but none of the usual approaches are working here - I don't feel like I have enough information to start with.

I feel like I need some way to narrow down the search. I have tried several variations on all vs all local alignments (blastp, psiblast, jackhmmer) in an attempt to find recurrent hits but nothing useful has come out of this. Out of desperation I tried a very naive MSA approach too, which unsurprisingly didn't yield anything.

Does anyone have any bright ideas?

genome gene alignment • 1.8k views

ADD COMMENT • link updated 7.4 years ago by Joe 22k • written 7.4 years ago by crouch.k ▴ 30

score 2 · Answer 1 · 2018-02-20

2

Entering edit mode

7.4 years ago

lieven.sterck 15k

One thing you might be able to try is the MEME software suite. It looks for 'conserved motifs' in unaligned sequences, both nucleotide and protein.

Just another small point: the MSA approach is not that naive I feel ;) : it is the way how many of these conserved domains are identified/constructed

ADD COMMENT • link 7.4 years ago by lieven.sterck 15k

0

Entering edit mode

Thanks! Yes I will have a play with MEME.

What I meant by the naivety of MSA is that at the moment I don't have any idea what a potential domain might look like or where it could be. I think I really need to try to refine domain boundaries first, even if not precisely, and then start looking at MSAs. In any event, the output I got with what I had was a mess!

ADD REPLY • link 7.4 years ago by crouch.k ▴ 30

score 1 · Answer 2 · 2018-02-21

I've recently been dealing with a similar problem: working with a non-model organism we have identified a set of proteins that bind a particular structure, and wanted to look for novel domains that might be responsible.

I agree with lieven.sterck that the MEME suite is a good place to start. Motif finding is a bit of a black art (you get rather different results with different programs), but another program I found very helpful was motifx (http://motif-x.med.harvard.edu/), now also available in an R package (https://github.com/omarwagih/rmotifx).

All-vs-all dotplots are also a good way to quickly see if there are some regions of similarity in your protein set eg (http://sonnhammer.sbc.su.se/Dotter.html), and you can quickly change parameters such as window-size and the similarity matrix to check you're not missing something obvious.

I've actually written my own solution to this problem, which I hope to make available on Cyverse by the end of the year. It involves splitting the protein sequences into short windows, generating a similarity matrix, clustering based on this matrix using a method which can pull clusters out of noise, and then refining the domain boundaries. It worked extremely well in our particular case, and I'd be happy to run your data through it if you want to get in touch.

score 1 · Answer 3 · 2018-02-21

Its a computationally expensive approach, but you could try a mix of threading and ab initio structure prediction. Sequences diverge much more quickly than do structures, so you may be able to try a 'brute force' in silico protein-protein docking approach.

You may simply not be able to 'see' this domain from the sequence alone. The Phyre webserver for threading allows batch submission, so you can do some 'cheap' computation that way. I-TASSER offers a download for the software so you can run it locally if you have the resources. If the sequences are short, the computation can be quite quick.

The threading approaches are still somewhat sequence constrained, but approaches that use ab initio steps might allow you to find a mystery structure that is energetically favourable but not well represented in the sequence databases you've been querying up to now.

One final quick and dirty thing you can do is run all the sequences through InterProScan which would give you an idea of where any domains appear in the sequence, which you might be able to use to guide your MSA gazing.