Question: Use PCoA plot rather than phylogenetic tree?
gravatar for prfsullivan
5 weeks ago by
prfsullivan0 wrote:


I have a large dataset (>10,000) of 16S rRNA sequences. Rather than build a phylogenetic tree, I'd rather visualize the analysis on a 2D PCoA-like plot.

I plan to use a maximum likelihood method for the analysis and ultimately want to portray the data in a PCoA-like plot. Tree topology is not important for this. Is it possible to run an ML-based analysis and obtain just the resulting distance matrix? I'd like to use the matrix to create the PCoA plot. Also, is the dataset too large? Any suggestions of softwares?

The goal of this analysis is to evaluate the relatedness of select strains (~300) against a larger global population.

Really appreciate the help!

Thanks, Peter

gene • 150 views
ADD COMMENTlink modified 5 weeks ago by h.mon31k • written 5 weeks ago by prfsullivan0

Can you edit your post and make it clear, please. Which data do you have and what do you want to highlight through the PCoA?

ADD REPLYlink written 5 weeks ago by antonioggsousa1.5k

Ok, updated the post.

ADD REPLYlink written 5 weeks ago by prfsullivan0

Or I don't understand what you're trying to do or I'm not familiar with the analysis that you want to do. Can you provide a reference paper that highlights a similar analysis to the one that you want to do, please?

From your description and based on my background on microbial ecology, this sounds like a beta-diversity analysis to me (sorry if I misunderstood). If so, you need to have at least 2 or more samples.

If the aim is to evaluate the relatedness of strains, I believe this can be done trough a phylogenetic tree. Of course that you probably need to collapse some branches to make it readable. Are the 10 K seqs non-redundant?

You can do a phylogenetic tree and PCoA analyses with QIIME2 software:

QIIME2 has many plugins that use third-party software tools. For instance for phylogenetic tree has fasttree among others:

It also allows to make ordination, such as PCoA, but you need to ensure that the data generated among different steps is compatible. QIIME2 has many tutorials, workshops, docs.

I hope this helps,


ADD REPLYlink written 5 weeks ago by antonioggsousa1.5k

Thanks for the reply. I haven't been able to find a reference paper doing something similar but will keep looking.

The 10k sequences is after redundant sequence removal.

I have a global dataset of 10,000 sequences. Of the 10,000, there are 300 sequences (strains) that I am looking to highlight. So, I'm basically doing a standard phylogenetic analysis. However, it prefer to portray the data in a PCoA plot rather than a tree format.

From what I understand, QIIME2 can create PCoA plots but it's only when doing a beta analysis of different populations.

Does that make sense?

ADD REPLYlink written 5 weeks ago by prfsullivan0

I don't think that you can do that, but I may have be wrong.

So, a PCA or PCoA is a multivariate method. So you need to have a set of observations across several (usually a few-to-thousand) variables. In your case I don't see which can be observations or variables. That's why a PCoA is used in beta-diversity because you've a set of samples/sites/communities (observations) across some OTUs/ASVs/16S seqs (variables). In this case you can make a ML-tree analysis, apply a beta-diversity phylogenetic distance, such as UniFrac, and display the distance matrix across an ordination method such as PCoA. But in your case if the 16S seqs are the variables which are the observations or vice-versa. I don't think that what you want to do is possible, but I may have be wrong.


ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by antonioggsousa1.5k

Yeah, I'm a little concerned that I can't do that. One suggestion I heard to was to bin the sequences into OTUs and then run a beta-diversity analysis.

Is there no tool available to assess phylogenetic relationships in a 2D plot? I've read about multidimensional scaling, which might be what I'm looking for.

ADD REPLYlink written 5 weeks ago by prfsullivan0
gravatar for h.mon
5 weeks ago by
h.mon31k wrote:

As long as you can calculate a distance / similarity matrix between the sequences, you can then use this matrix in a number of visualization methods, such as PCA, PCoA, etc.

The question is why you would do that? Relationship between sequences is best viewed and interpreted as a phylogenetic tree, because a tree captures the hierarchical relation between sequences in such a way a 2-D, non-hierarchical visualization can not capture. In the past few years there has been a tremendous effort to create interactive and / or richly annotated trees, in order to provide additional insights in addition to the tree itself. I believe this would solve your problem, although it is not clear which problem you are trying to solve so far.

ADD COMMENTlink written 5 weeks ago by h.mon31k

Thanks for the reply.

Reasons I want a 2D non-hierarchical vsuaization: 1) I've found when using the 16S sequence (750 bp length), branches between two distant clades tend to have low bootstrap support. Thus, clear hierarchical relationships are often difficult to establish.
2) I'm only looking to see how each strain relates to all other strains. I don't really care about evolutionary transitions that have occurred. I basically want a phylogenetic "map" of the dataset if that makes sense. 3) With a very large dataset, the tree will be difficult to portray.

If I were to generate a distance matrix from a standard phylogenetic analysis (max likelihood, probably), I could then use that data to visualize it in a PCA-like plot? Do you have any suggestions for softwares that may be able to do this?

Appreciate the help!

ADD REPLYlink written 5 weeks ago by prfsullivan0

To add a bit more perspective on the second reason -- I work in field where we isolate secondary metabolites to identify drug leads. Highly related organisms (species, more or less) tend to produce the same chemistry. More distantly related strains generally do not. Thus, I only really care if strains form a "species cluster". And I would like to know and visualize the number of "species clusters" in my dataset.

ADD REPLYlink written 5 weeks ago by prfsullivan0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2133 users visited in the last hour