Question

How To Find Protein Clusters/Families?

4

Entering edit mode

13.8 years ago

Panos ★ 1.8k

I have a metagenomic dataset with several hundred thousand, relatively small genes and I want to find the protein families that exist in there (based on their aminoacid sequence)...

I've tried cdhit with a 40% threshold as well as mclblastline with a varying inflation value of 3.0 up to 10.0. cdhit creates smaller (less members in each family) but more homogeneous families compared to mcl (as seen by the present domains / KEGG orthology).

I think I like better cdhit's clusters; the only thing that I don't like about it is that I can't decrease the threshold below 40% (which I think it may be necessary for such a task). On the other hand, mcl is used exactly for this kind of task, right?

Is there anyone who has done such an analysis so that he/she can give me some advice? I'm particularly interested in people's opinion about mcl.

clustering protein • 4.7k views

ADD COMMENT • link updated 13.8 years ago by Casbon ★ 3.3k • written 13.8 years ago by Panos ★ 1.8k

1

Entering edit mode

Are you trying to find a specific family or classify your whole set? Is it enough for you to classify your genes to known protein families or do want unknowns clustered too? - I think there are quite a few possible approaches, depending on what you are trying to achieve.

ADD REPLY • link 13.8 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

Thanks for the comment Michael!

There are two things I want to do:

(a) find sub-families within families identified by HMMer3 (eg different groups of ABC transporters), and

(b) find families that don't exist in pfam.

ADD REPLY • link 13.8 years ago by Panos ★ 1.8k

score 4 · Answer 1 · 2010-10-21

4

Entering edit mode

13.8 years ago

Casbon ★ 3.3k

I wrote a paper on clustering protein families a while back: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1409676/ This figure compares tribe mcl to a spectral method - the target groups are SCOP structural families. There is now an implementation available (haven't tried it myself) http://www.paccanarolab.org/software/scps/index.html

Nevertheless, I would treat this is as a classification problem than a clustering problem. I would use hmmer3 with pfam models to classify them into known pfam families unless you have reason to suspect that your data contains previously unseen families.

ADD COMMENT • link 13.8 years ago by Casbon ★ 3.3k

0

Entering edit mode

Thanks for the reply!

I have already done HMMer prediction using the entire pfam database but there are still lots of genes containing no domain(s) at all...

ADD REPLY • link 13.8 years ago by Panos ★ 1.8k

score 1 · Answer 2 · 2010-10-19

Try Sarah Teichmann's "GENEFAMMER" its a protein family clustering package.

Its an old programm, I could not find the direct link to download. But you can write to Authors (Jong Park & Sarah Techimann).

other info http://www.ncbi.nlm.nih.gov/proteinclusters

UCLUST: http://www.drive5.com/usearch/intro.html