What I'm trying to do is take a large number of amino acid sequences (~1000) and condense them to a few clusters (~10), based solely on similarity of the sequences. These amino acid sequences are all orthologs to begin with (same Pfam family). I have taken two approaches for this, and I am stuck on both.
Approach 1) MCL Clustering In this approach, I take my 1000 sequences and I run CD-Hit, which reduces my ~1100 sequences to 1 cluster of 164 sequences. I then take these 164 sequences and I use blastp on an all-against-all basis, and get E-values for each pairwise comparison. Following the approach outlined http://micans.org/mcl/ , I run the MCL program, but it will not form clusters regardless of inflation parameter. My struggles with this approach are outlined here: Troubleshooting MCL - always returns 1 cluster no matter inflation value
Approach 2) Clustering with Phylip In this approach, I take my 1000 sequences and I run CD-Hit, which reduces it to 164 sequences. I then do MSA using Mega(Muscle), and save it as a nexus file. To get protdist to accept the input file, I use http://www-bimas.cit.nih.gov/molbio/readseq/ to convert it into Phylip format. I then feed it into protdist. I am uncertain where to proceed from here. I have seen threads discuss the seqboot -> protdist -> neighbor -> consense, but I don't understand why each program is needed. The concept of bootstrapping is confusing, and it seems like my goal is fairly straightforward. Any help would be most appreciated!