Question: Cluster proteins according to their domain architecture
2
gravatar for dago
5.4 years ago by
dago2.6k
Germany
dago2.6k wrote:

I have a list of proteins that I want to divide into homogeneous (in terms of potential similar function) clusters.

As first approach I clustered them using h-cd-hit, with 3 reiteration at 90-80-70% similarity and allowing only 75 aa difference among the proteins. This parameters were chosen because they better resolve my data.

I obtained decent results for them, but when I look at the domain composition of the representative sequences of each clusters I can see that in same cases I have highly similar domain architecture.  I would say that similar domain architecture suggest similarity in function. Therefore I would like to perform a second clustering based on similarity of domain architecture.

For example:

1) ----domA-------domB---domC

2) ---domB---domC-------

3)---domA--domW---domE

 

In this case 1) and 2) will cluster together.

Is there any available tool for doing that?

ADD COMMENTlink modified 4.3 years ago by Biostar ♦♦ 20 • written 5.4 years ago by dago2.6k

Today I got to know about this  tool:

MSA-PAD: DNA multiple sequence alignment framework based on PFAM accessed domain information

I do not really see the advantage of performing the comparison between proteins based on Pfam domains at the DNA level. Any comments?

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by dago2.6k
6
gravatar for a.zielezinski
5.4 years ago by
a.zielezinski9.2k
a.zielezinski9.2k wrote:

That's a very interesting question. I'm dying to hear other people's answers. I don't recall any programs that could do this kind of clustering automatically. However, this could be done in two separate steps:

  1. Calculate pairwise similarities between all versus all domain architectures. Here, you will need to assume some kind of similarity measure between domain architectures. Smart idea was proposed recently in the paper from Bioinformatics: the similarity between two proteins was determined by aligning the proteins’ domain arrangements to each other using a dynamic programming algorithm (Needleman-Wunsch or Smith-Waterman).
  2. Use the file with pairwise similarities as an input for clustering programs [e.g. Markov Clustering (MCL), Transitive Clustering (TransClust), Spectral Clustering of Protein Sequences (SCPS), and High-Fidelity clustering of protein sequences (HiFix)]. These programs allow the efficient and rapid clustering of any arbitrary set of protein sequences, given a list of all pairwise similarities obtained by another method. I highly recommend using TransClust, published in Nature Methods (2010). A recent benchmarking paper says that this program clearly outperformed the other methods for most datasets.
ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by a.zielezinski9.2k
1

By the way there is also a new paper that introduces the Multiple Domain Alignmetn using the RADS algorithm

http://www.biomedcentral.com/1471-2105/16/19/abstract

ADD REPLYlink written 5.4 years ago by dago2.6k

This is a really good suggestion, thanks! Although, I knew Transclust, the first tool you suggested it is new for me. I just finished to read the paper. Really nice. I will give it a try now!

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by dago2.6k
1
gravatar for 5heikki
5.4 years ago by
5heikki8.9k
Finland
5heikki8.9k wrote:

One way would be to generate dummy files from the sequence data, so instead of:

>seq1
KJSLKJALKD
>seq2
ADJSKJKJKDDSJK
>seq3
DSJDKJDKSJKJDS

There would a file with:

>seq1
DOMA DOMC
>seq2
DOMC DOMA DOMB
>seq3
DOMA DOMA DOMC

 

And then cluster these with some algorithm which allows you to define your own alphabet..

ADD COMMENTlink written 5.4 years ago by 5heikki8.9k

Thanks. This is indeed the same approach I was planning to use, combining it with the tools that @a.zielenziski suggested. However, I am having trouble to convert my files (gff, hmmscan, pfamsca output) into xdom format. Anyone ever worked with xdom format?

They look like that:

>ENSP00000376776  617
57  171 DOMON 2.0e-25
213 341 Cu2_monooxygen  7.5e-43
360 521 Cu2_monoox_C  2.3e-52
ADD REPLYlink written 5.4 years ago by dago2.6k

dago: If you show snippets (maybe a few lines) of your input files I will help you to write a script that creates the XDOM-formatted file.

ADD REPLYlink written 5.4 years ago by a.zielezinski9.2k

You are right.

I was thinking to convert an tsvobtained from an interpro scan

 

P51587  14086411a2cdf1c4cba63020e1622579        3418    Pfam    PF09103 BRCA2, oligonucleotide/oligosaccharide-binding, domain 1        2670    2799    7.9E-43 T       15-03-2013
P51587  14086411a2cdf1c4cba63020e1622579        3418    ProSiteProfiles PS50138 BRCA2 repeat profile.   1002    1036    0.0     T       18-03-2013      IPR002093       BRCA2 repeat    GO:0005515|GO:0006302
P51587  14086411a2cdf1c4cba63020e1622579        3418    Gene3D  G3DSA:2.40.50.140               2966    3051    3.1E-52 T       15-03-2013
...

 

But I am struggling quite a bit with that

 

ADD REPLYlink written 5.4 years ago by dago2.6k

You should also have somewhere a file with protein sequences in FASTA format. We need information about lengths of the proteins. 

>ENSP00000376776  617 <-- length

The output from InterProScan doesn't include this information.

ADD REPLYlink written 5.4 years ago by a.zielezinski9.2k

I read on the google page of the interproscan that the third column is the seq length.

P51587  14086411a2cdf1c4cba63020e1622579        3418 <--
ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by dago2.6k

Aaaa - you're right. Are there spaces or tabs separating columns in the InterProScan output? 

ADD REPLYlink written 5.4 years ago by a.zielezinski9.2k

I think you can choose, tsv (the one above), html, xml, and gff. I jsut found a Perl module called Xdom. To create such files. But I cannot get it running

ADD REPLYlink written 5.4 years ago by dago2.6k
1

If the InterProScan file is tab-separated this script should do the trick:

from itertools import groupby
import sys

fh = open(sys.argv[1])
oh = open(sys.argv[1]+'.xdom', 'w')

EVALUE_CUTOFF = 1e-03

for k, g in groupby(fh, lambda x: x.split()[0]):
    l = []
    for line in g:
        sl = line.split('\t')
        print sl
        if sl[3] == 'Pfam' and float(sl[8])<=EVALUE_CUTOFF:
            l.append('{0}\t{1}\t{2}\t{3}\n'.format(sl[6], sl[7], sl[4], sl[8]))
    if l:
        oh.write('>{0}\t{1}\n'.format(k, sl[2]))
        oh.writelines(l)
oh.close()
fh.close()

Save this code to the interpro2xdom.py file. Run it as follows:

python interpro2xdom.py file_with_interpro_domains

The results will be saved in file_with_interpro_domains.xdom

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by a.zielezinski9.2k

This is really great. However, it prints only the protein name and the length. this part

"oh.write('{0}\t{1}\t{2}\t{3}\n'.format(sl[9], sl[10], sl[4], sl[11]))"


should refer to the filed to append after the > right?

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by dago2.6k

Hmm.. it works well on my sample file. Make sure, your file is tab-separated and contains domains identified by Pfam.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by a.zielezinski9.2k

Thanks. I tried with my data and also with the tsv example I reported above, but no results

ADD REPLYlink written 5.4 years ago by dago2.6k

Paste several lines of your InterProScan. I'll try to figure what's wrong.

ADD REPLYlink written 5.4 years ago by a.zielezinski9.2k

Well, thanks very much

https://www.dropbox.com/s/f750gs0jf3efw4a/tmp?dl=0

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by dago2.6k
1

Thanks, I edited my comment post. Please try the new code, it should work fine.

ADD REPLYlink written 5.4 years ago by a.zielezinski9.2k

There are some warnings and error, but it does the job quite well I must say. I really appreciate that.

ADD REPLYlink written 5.4 years ago by dago2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 681 users visited in the last hour