Question: Meaningful clusters in phylogenetic trees
0
gravatar for n00bgenome
3.9 years ago by
n00bgenome20
n00bgenome20 wrote:

So I have generated a large fastA file where I took 2 protein domains from a protein sequence. To be clear, I had an initial protein sequence like AAAAAAABBBBBBCCCCCDDDDDDEEEEEE and I created a new sequence that was BBBBBBDDDDDD.

Now, with Mega, I used Muscle to align all my sequences, and then generated a Maximum Likelihood Tree. Many of sequences were redundant or very similar. What I would like to do is to group these sequences together into clusters based on how different they are in sequence. So if I have ~300 sequences, I would like to group them into ~30 clusters. How would I go about doing this? How can I get a measure of how different the sequences are in absolute terms, not just the binning process by these trees where it gives me value for branch length?

Thanks so much!

phylogeny • 2.5k views
ADD COMMENTlink modified 3.9 years ago by Naresh60 • written 3.9 years ago by n00bgenome20
2

You created artificial sequences and now want to further divide them into a random number of clusters? What valid scientific inference can you draw after these sorts of manipulations?

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by genomax80k

So we have two regions, A & B, in a protein that are responsible for binding to promoter for RNA polymerization. For primary sigma factors, there are 3000 available sequences, and I would like to know how closely related the sequences for promoter specificity is across these 3000 sequences. So, for example, if many of them are identical or very closely related, I would like to collapse that into a group and represent that as a branch on the tree. So I could gain insight in knowing that of the 3000 sequences sequenced, there are grouped into X number that are actually very distinct from one another.

What I am asking is, how is this broad binning done? Is there a value that gives the percentage of similarity across sequences that I could use?

ADD REPLYlink written 3.9 years ago by n00bgenome20
1

I am not a phylogenetics expert and someone else may have more enlightening advice later but for now here goes.
Since you have done a MSA of all the sequences did you build a simple NJ tree? You could use that as a guide to select sequences for representation of "clusters" that you are thinking of. This thread Reducing Number Of Sequences For Phylogentic Tree Construction suggests using 90% identity. Still by artificially joining those two domains you are likely ignoring important information (just a feeling).

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by genomax80k

Thanks for the link, I'll use that.

I have created a Max likelihood tree which does help a bit, but I'll need something more programmatic to get down to a smaller number of clusters.

In regards to your point on artificially joining these domains, we conversed on the other thread about this. I considered adding a series of dashes to make sure there are not any overlaps bridging the two sequences, but you made the point that since there is homology, it will align that way anyway (which is what I think did happen).

I need to use both protein domains instead of using them individually because the first domain specifies the -35 region and the second domain specifies the -10 region. Therefore, if I split them up, I won't be able to tell which sigma factors as a whole are distinct (e.g. sigma factor 1 has regions A & B and sigma factor 2 has regions A & C)

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by n00bgenome20

In regards to your point on artificially joining these domains, we conversed on the other thread about this. I considered adding a series of dashes to make sure there are not any overlaps bridging the two sequences, but you made the point that since there is homology, it will align that way anyway (which is what I think did happen).

I was referring to the intervening sequence between the two domains, leaving it intact. If the proteins are homologous that part should align as well.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by genomax80k

Hi, Even i have similar problem like you. Your research area matches with mine. Is that Ok if we discuss.

ADD REPLYlink written 3.9 years ago by Naresh60
1
gravatar for n00bgenome
3.9 years ago by
n00bgenome20
n00bgenome20 wrote:

I think I have a partial answer based on the following question:

Clustering Using Phylip

Fundamentally, my question is if one has hundreds or thousands of sequences that have been aligned, how does one group them together based on sequence similarity into a tree or clusters based on sequence similarity?

So if he has 320 motifs, Lars recommended that he use MCL to cluster them into different groups. He would load the 320 AA sequences and use Blast for the multiple sequence alignment. Then he would use the MCL and different inflation parameters to get varying levels of groupings (coarse to fine). Then, you could output that into a tree. That seems like it'll get me what I want.

However, if I choose that method, how does it output the name of the nestings? And this method appears to only use Blast? Is there a similar tool that use Muscle Alignments like I have done?

ADD COMMENTlink written 3.9 years ago by n00bgenome20
1
gravatar for 5heikki
3.9 years ago by
5heikki8.7k
Finland
5heikki8.7k wrote:

I would take unaligned sequences and do something like this:

for i in $(seq 60 5 95); do cd-hit -i seqs.fa -o seqs.$i.fa -c 0.$i -M 0 -T 0 -g 1; done

I would study output after each iteration to see if cluster count is what I wanted (if yes, then kill the loop).

Or if it really has to be some exact number of clusters, then I would make a distance matrix from the alignment file and do k-means clustering in R.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by 5heikki8.7k
0
gravatar for n00bgenome
3.9 years ago by
n00bgenome20
n00bgenome20 wrote:

Thanks so much! cd-hit gets rid of many of the redundant species.

I meant to ask a new question in a different thread now, but I've unintentionally "answered" here, so I thought it would be better to ask my next question than to post another thread.

So I'd like to take my file of sequences and do a blastall to get an all-against-all matrix that I may use for inputting into MCL.

However, I'm a bit unclear on how to use blastall. Looking at prior conversations, from what I understand, I should format my file of sequences using formatdb, and then use that as the input to query my original file of sequences with blastall. When I try to do this using command line, I got an error:

legacy_blast.pl formatdb -i inputfilei.fasta -p T --path mypath

BLAST Database creation error: FASTA-Reader: No residues given

Am I going about this the right way? Why did I get this error?

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by n00bgenome20
1

Use this guide for blast+ applications: http://www.ncbi.nlm.nih.gov/books/NBK279690/

makeblastdb is the new equivalent in blast+ package which is what you should have got. blastn or blastp would be the programs to use. Don't use blastall.

Take a look at this also: http://www.nature.com/protocolexchange/protocols/2740

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by genomax80k

I would still do an MSA on the set left over after CD-HIT and go from there.

ADD REPLYlink written 3.9 years ago by genomax80k

Is blastP inferior to Muscle?

I could either do k-means clustering or MCL, and I think MCL would give me what I want, but it only takes in blast files. After reading more, I have realized I am interested in phenetic differences, not phylogeny.

ADD REPLYlink written 3.9 years ago by n00bgenome20
1

Is blastP inferior to Muscle?

They are two different tools meant for different purposes. With blastp you are trying to identify homologs. Once that is done one would use a MSA program like muscle to align those putative homologs.
If you think MCL is going to work for you then go for it. You may come back and decide to try something else after you see the results. This is all part of learning.

ADD REPLYlink written 3.9 years ago by genomax80k

Oh, I think I understand now. I should do a Muscle MSA after the CD-HIT, and then use those sequences to do Blast. These sequences should first be made into a database with makeblastdb and then Blasted all against all using blastp (why not blastall -p blastp?). I can then submit this to MCL. Is that right? Thanks for the help!

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by n00bgenome20

If you have reduced your data down to a non-redundant set then do the blast (if that is needed for MCL). With blastp you probably want to select only stringent top hits because otherwise you would have a ton of results to weed through (as these are all homologs).

ADD REPLYlink written 3.9 years ago by genomax80k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1127 users visited in the last hour