Question

Clustering protein sequences

0

Entering edit mode

9.2 years ago

n00bgenome ▴ 40

Hi all,

What I'm trying to do is take a large number of amino acid sequences (~1000) and condense them to a few clusters (~10), based solely on similarity of the sequences. These amino acid sequences are all orthologs to begin with (same Pfam family). I have taken two approaches for this, and I am stuck on both.

Approach 1) MCL Clustering In this approach, I take my 1000 sequences and I run CD-Hit, which reduces my ~1100 sequences to 1 cluster of 164 sequences. I then take these 164 sequences and I use blastp on an all-against-all basis, and get E-values for each pairwise comparison. Following the approach outlined http://micans.org/mcl/ , I run the MCL program, but it will not form clusters regardless of inflation parameter. My struggles with this approach are outlined here: Troubleshooting MCL - always returns 1 cluster no matter inflation value

Approach 2) Clustering with Phylip In this approach, I take my 1000 sequences and I run CD-Hit, which reduces it to 164 sequences. I then do MSA using Mega(Muscle), and save it as a nexus file. To get protdist to accept the input file, I use http://www-bimas.cit.nih.gov/molbio/readseq/ to convert it into Phylip format. I then feed it into protdist. I am uncertain where to proceed from here. I have seen threads discuss the seqboot -> protdist -> neighbor -> consense, but I don't understand why each program is needed. The concept of bootstrapping is confusing, and it seems like my goal is fairly straightforward. Any help would be most appreciated!

Clustering phylip MCL protdist seqboot • 3.6k views

ADD COMMENT • link updated 9.1 years ago by Biostar 20 • written 9.2 years ago by n00bgenome ▴ 40

1

Entering edit mode

Since you have 164 unique sequences (or so it appears) trying to cram them into 10 clusters is not meaningful.
But since you want to to do it you could look at the muscle alignment and then cut the accompanying NJ tree into pieces such that you end up with 10 "clusters".

ADD REPLY • link 9.2 years ago by GenoMax 152k

0

Entering edit mode

They are definitely not all unique, if anything, they are too similar. My blast results show *the vast majority of E values above E-10. I don't want to cluster them into definitively 10, but they should be able to be grouped further. MCL just gives me 1 cluster as if they are all too similar to be parsed up further.

Doing a NJ tree in Mega, did get me closer to what I want, but its like you said, it isn't "meaningful." Do you have any idea why MCL won't parse it up further, if that works, that seems to be what I'm looking for? Will the protdist workflow get me something similar?

ADD REPLY • link 9.2 years ago by n00bgenome ▴ 40

0

Entering edit mode

Here is what the NJ tree did:

http://postimg.org/image/ciq2ckjox/

So the problem is that I don't have any sense of how far away those clusters in the trees are. It seems like protdist might be the way to go to get that information? Can you recommend how I'd go about doing that, in slightly more detail than seqboot -> protdist -> neighbor -> consense? Thank you very much for all your help.

ADD REPLY • link 9.2 years ago by n00bgenome ▴ 40

0

Entering edit mode

Since these sequences are artificial (you have joined two domains together, correct?) I am not sure if you can derive a valid inference. You can bootstrap the tree to get bootstrap values that would indicate a measure of confidence for the branches.
If you start at the top and descend down the tree you should be able to roughly cut the tree into 10-15 clusters.
I don't remember if we discussed this across these many threads but what is your ultimate aim here?

ADD REPLY • link 9.2 years ago by GenoMax 152k

0

Entering edit mode

I also made a max likelihood tree and got much the same result.

I want to find a few sequences that will have enough 'coverage' of the diversity of all the sequences. So if I chose 10 clusters, and I picked a representative sequence for each cluster, than I would have a high level of similarity in each cluster to the representative sequence. However, the concept of the degree of similarity is where I am struggling. I was hoping it would split into these obvious clusters, but that was clearly silly.

Thanks for all your help genomemax2!

ADD REPLY • link 9.1 years ago by n00bgenome ▴ 40

0

Entering edit mode

Have you tried different inflation parameters with MCL? This would influence the number of clusters you get.

ADD REPLY • link 9.1 years ago by Christian ★ 3.1k