Whole genome species clustering
1
1
Entering edit mode
15 days ago

I am working with a poorly defined genus of fungi with very few publicly available genomes and I am wanting to cluster the all available ones plus my novel ones into putative species. I am unsure of a good workflow for this.

Would it be better to current phylogenetic workflows of multiple sequence alignment and tree construction and then based on some phylogenetic distance threshold cluster the genomes into species?

OR

Would some form of multiple sequence analysis (with Busco datasets for example) or core genome analysis (with Roary) where we take SNP differences between the MLSA sequences or core genomes and then based on a SNP count threshold cluster the assemblies into putative species?

I apologies if this is a naive question but any help would be appreciated.

phylogeny assembly phylogenetics • 210 views
ADD COMMENT
1
Entering edit mode

What is your ultimate aim? Are you looking to see which of established species your sample is closer to?

This appears to be a useful paper.

ADD REPLY
0
Entering edit mode

Yes that would be part of the goal. The genus is Termitomyces which have very few clearly labelled species on NCBI.

We are mining the genomes for secondary metabolite production so it would be nice to at least have species clustering, if not knowing the exact species name our assemblies belong to so we can identify any patterns of secondary metabolite production. I will check this paper out thanks.

Along with that I want to display annotation statistics and repeat region statistics on a by species basis to see if within putative species has the same patterns.

I should note that my novel assemblies and as far as I am aware all other assemblies for the genus are draft genomes.

ADD REPLY
4
Entering edit mode
15 days ago
h.mon 32k

I don't think there is a "best" approach in your situation, given

1) the taxonomy of the group is probably still flaky.

2) being draft genomes, they probably are spotty and have sizeable gaps, and maybe even contaminants. Repeat regions are often left out of such assemblies.

Actually, a good workflow would perform different types of analyses, in order to get a more complete and solid overall picture about the genus. Some suggestions:

ReferenceSeeker will give you the closest species, in terms of kmer (min-hash) distances and average nucleotide identity (ANI). You can easily find if there are some really close genomes with it.

Mashtree uses the same kmer (min-hash) distances to group genomes in a dendrogram. The authors don't consider this dendrogram a phylogeny, but I think this dendrogram probably reflects the phylogenetic history of the genomes, anyway. Thus, it may complement a more traditional core genome phylogeny. Whole genome alignments probably aren't a good option, unless all genomes are really close (and, I would argue, also really high quality). Mashtree works as a substitute for whole genome alignments.

A core genome phylogeny would complement the above analyses (but I don't think Roary works for fungal genomes). As there aren't many genomes available, you can get a SNP-based phylogeny, or even a maximum-likelihood or Bayesian phylogeny with all genes concatenated. Running the same version of BUSCO on all genomes would also give a good dataset for a phylogeny, in addition to informing on the overall quality of the genomes.

ADD COMMENT
0
Entering edit mode

Thank you very much for this response! I will try out these methods. Why could you not use maximum-likelihood and Bayesian phylogeny on core genomes SNPs?

ADD REPLY
0
Entering edit mode

SNPs can be used for ML / Bayesian phylogeny estimation, but branch length estimation will be off due to ascertainment bias. I am not up to date with the latest versions, but I believe IQTree and RAxML both implement a correction for such a case - be sure to read their manual to find the correct options.

ADD REPLY

Login before adding your answer.

Traffic: 1944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6