Question

Whole genome species clustering

1

Entering edit mode

4.2 years ago

robert.murphy ▴ 110

I am working with a poorly defined genus of fungi with very few publicly available genomes and I am wanting to cluster the all available ones plus my novel ones into putative species. I am unsure of a good workflow for this.

Would it be better to current phylogenetic workflows of multiple sequence alignment and tree construction and then based on some phylogenetic distance threshold cluster the genomes into species?

OR

Would some form of multiple sequence analysis (with Busco datasets for example) or core genome analysis (with Roary) where we take SNP differences between the MLSA sequences or core genomes and then based on a SNP count threshold cluster the assemblies into putative species?

I apologies if this is a naive question but any help would be appreciated.

phylogeny assembly phylogenetics • 2.2k views

ADD COMMENT • link updated 21 months ago by pmiller ▴ 10 • written 4.2 years ago by robert.murphy ▴ 110

1

Entering edit mode

What is your ultimate aim? Are you looking to see which of established species your sample is closer to?

This appears to be a useful paper.

ADD REPLY • link 4.2 years ago by GenoMax 152k

0

Entering edit mode

Yes that would be part of the goal. The genus is Termitomyces which have very few clearly labelled species on NCBI.

We are mining the genomes for secondary metabolite production so it would be nice to at least have species clustering, if not knowing the exact species name our assemblies belong to so we can identify any patterns of secondary metabolite production. I will check this paper out thanks.

Along with that I want to display annotation statistics and repeat region statistics on a by species basis to see if within putative species has the same patterns.

I should note that my novel assemblies and as far as I am aware all other assemblies for the genus are draft genomes.

ADD REPLY • link 4.2 years ago by robert.murphy ▴ 110

0

Entering edit mode

Sorry for necrotizing this, but do you know if trees consturcted using the whole-proteome FFP method are still optimal over tree based on other distance metrics?

ADD REPLY • link 3.8 years ago by robert.murphy ▴ 110

score 6 · Accepted Answer · 2021-04-21

I don't think there is a "best" approach in your situation, given

1) the taxonomy of the group is probably still flaky.

2) being draft genomes, they probably are spotty and have sizeable gaps, and maybe even contaminants. Repeat regions are often left out of such assemblies.

Actually, a good workflow would perform different types of analyses, in order to get a more complete and solid overall picture about the genus. Some suggestions:

ReferenceSeeker will give you the closest species, in terms of kmer (min-hash) distances and average nucleotide identity (ANI). You can easily find if there are some really close genomes with it.

Mashtree uses the same kmer (min-hash) distances to group genomes in a dendrogram. The authors don't consider this dendrogram a phylogeny, but I think this dendrogram probably reflects the phylogenetic history of the genomes, anyway. Thus, it may complement a more traditional core genome phylogeny. Whole genome alignments probably aren't a good option, unless all genomes are really close (and, I would argue, also really high quality). Mashtree works as a substitute for whole genome alignments.

A core genome phylogeny would complement the above analyses (but I don't think Roary works for fungal genomes). As there aren't many genomes available, you can get a SNP-based phylogeny, or even a maximum-likelihood or Bayesian phylogeny with all genes concatenated. Running the same version of BUSCO on all genomes would also give a good dataset for a phylogeny, in addition to informing on the overall quality of the genomes.