Question

Cluster proteins according to their domain architecture

2

Entering edit mode

10.3 years ago

dago ★ 2.8k

I have a list of proteins that I want to divide into homogeneous (in terms of potential similar function) clusters.

As first approach I clustered them using h-cd-hit, with 3 reiteration at 90-80-70% similarity and allowing only 75 aa difference among the proteins. This parameters were chosen because they better resolve my data.

I obtained decent results for them, but when I look at the domain composition of the representative sequences of each clusters I can see that in same cases I have highly similar domain architecture. I would say that similar domain architecture suggest similarity in function. Therefore I would like to perform a second clustering based on similarity of domain architecture.

For example:

----domA-------domB---domC
---domB---domC-------
---domA--domW---domE

In this case 1) and 2) will cluster together.

Is there any available tool for doing that?

PROTEIN domains clustering phylogeny alignment • 5.7k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.3 years ago by dago ★ 2.8k

0

Entering edit mode

Today I got to know about this tool:

MSA-PAD: DNA multiple sequence alignment framework based on PFAM accessed domain information

I do not really see the advantage of performing the comparison between proteins based on Pfam domains at the DNA level. Any comments?

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.3 years ago by dago ★ 2.8k

Ram · Answer 1 · 2015-04-01

That's a very interesting question. I'm dying to hear other people's answers. I don't recall any programs that could do this kind of clustering automatically. However, this could be done in two separate steps:

Calculate pairwise similarities between all versus all domain architectures. Here, you will need to assume some kind of similarity measure between domain architectures. Smart idea was proposed recently in the paper from Bioinformatics: the similarity between two proteins was determined by aligning the proteins' domain arrangements to each other using a dynamic programming algorithm (Needleman-Wunsch or Smith-Waterman).
Use the file with pairwise similarities as an input for clustering programs [e.g. Markov Clustering (MCL), Transitive Clustering (TransClust), Spectral Clustering of Protein Sequences (SCPS), and High-Fidelity clustering of protein sequences (HiFix)]. These programs allow the efficient and rapid clustering of any arbitrary set of protein sequences, given a list of all pairwise similarities obtained by another method. I highly recommend using TransClust, published in Nature Methods (2010). A recent benchmarking paper says that this program clearly outperformed the other methods for most datasets.

Ram · Answer 2 · 2015-04-01

1

Entering edit mode

10.3 years ago

5heikki 11k

One way would be to generate dummy files from the sequence data, so instead of:

>seq1
KJSLKJALKD
>seq2
ADJSKJKJKDDSJK
>seq3
DSJDKJDKSJKJDS

There would a file with:

>seq1
DOMA DOMC
>seq2
DOMC DOMA DOMB
>seq3
DOMA DOMA DOMC

And then cluster these with some algorithm which allows you to define your own alphabet..

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by 5heikki 11k

0

Entering edit mode

Thanks. This is indeed the same approach I was planning to use, combining it with the tools that @a.zielenziski suggested. However, I am having trouble to convert my files (gff, hmmscan, pfamsca output) into xdom format. Anyone ever worked with xdom format?

They look like that:

>ENSP00000376776  617
57  171 DOMON 2.0e-25
213 341 Cu2_monooxygen  7.5e-43
360 521 Cu2_monoox_C  2.3e-52

ADD REPLY • link 10.3 years ago by dago ★ 2.8k

0

Entering edit mode

dago: If you show snippets (maybe a few lines) of your input files I will help you to write a script that creates the XDOM-formatted file.

ADD REPLY • link 10.3 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

You are right.

I was thinking to convert a tsv obtained from an interpro scan

P51587  14086411a2cdf1c4cba63020e1622579        3418    Pfam    PF09103 BRCA2, oligonucleotide/oligosaccharide-binding, domain 1        2670    2799    7.9E-43 T       15-03-2013
P51587  14086411a2cdf1c4cba63020e1622579        3418    ProSiteProfiles PS50138 BRCA2 repeat profile.   1002    1036    0.0     T       18-03-2013      IPR002093       BRCA2 repeat    GO:0005515|GO:0006302
P51587  14086411a2cdf1c4cba63020e1622579        3418    Gene3D  G3DSA:2.40.50.140               2966    3051    3.1E-52 T       15-03-2013
...

But I am struggling quite a bit with that

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by dago ★ 2.8k

0

Entering edit mode

You should also have somewhere a file with protein sequences in FASTA format. We need information about lengths of the proteins.

>ENSP00000376776  617 <-- length

The output from InterProScan doesn't include this information.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

I read on the google page of the interproscan that the third column is the seq length.

P51587  14086411a2cdf1c4cba63020e1622579        3418

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by dago ★ 2.8k

0

Entering edit mode

Aaaa - you're right. Are there spaces or tabs separating columns in the InterProScan output?

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

I think you can choose, tsv (the one above), html, xml, and gff. I jsut found a Perl module called Xdom. To create such files. But I cannot get it running

ADD REPLY • link 10.3 years ago by dago ★ 2.8k

1

Entering edit mode

If the InterProScan file is tab-separated this script should do the trick:

from itertools import groupby
import sys

fh = open(sys.argv[1])
oh = open(sys.argv[1]+'.xdom', 'w')

EVALUE_CUTOFF = 1e-03

for k, g in groupby(fh, lambda x: x.split()[0]):
    l = []
    for line in g:
        sl = line.split('\t')
        print sl
        if sl[3] == 'Pfam' and float(sl[8])<=EVALUE_CUTOFF:
            l.append('{0}\t{1}\t{2}\t{3}\n'.format(sl[6], sl[7], sl[4], sl[8]))
    if l:
        oh.write('>{0}\t{1}\n'.format(k, sl[2]))
        oh.writelines(l)
oh.close()
fh.close()

Save this code to the interpro2xdom.py file. Run it as follows:

python interpro2xdom.py file_with_interpro_domains

The results will be saved in file_with_interpro_domains.xdom

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

This is really great. However, it prints only the protein name and the length. this part

"oh.write('{0}\t{1}\t{2}\t{3}\n'.format(sl[9], sl[10], sl[4], sl[11]))"

should refer to the filed to append after the > right?

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by dago ★ 2.8k

0

Entering edit mode

Hmm.. it works well on my sample file. Make sure, your file is tab-separated and contains domains identified by Pfam.

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.3 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Thanks. I tried with my data and also with the tsv example I reported above, but no results

ADD REPLY • link 10.3 years ago by dago ★ 2.8k

0

Entering edit mode

Paste several lines of your InterProScan. I'll try to figure what's wrong.

ADD REPLY • link 10.3 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Well, thanks very much

https://www.dropbox.com/s/f750gs0jf3efw4a/tmp?dl=0

ADD REPLY • link 10.3 years ago by dago ★ 2.8k

1

Entering edit mode

Thanks, I edited my comment post. Please try the new code, it should work fine.

ADD REPLY • link 10.3 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

There are some warnings and error, but it does the job quite well I must say. I really appreciate that.

ADD REPLY • link 10.3 years ago by dago ★ 2.8k