Question

Which alignment for whole bacterial genomes for phylogenetic tree?

1

Entering edit mode

6 months ago

Dom ▴ 10

I am supposed to build a phylogenetic tree out of around 200 bacterial genomes. They are not closely related.

The first step is as far as I know alignment. My first plan was to do multiple sequence alignment with the whole genomes. I have tried MEGA, MUSCLE, MAUVE and then some. Most often I get errors for some sequences being too long. So I thought about doing it with a 16S rRNA, but for many species they aren't available. The genomes I have I got from NCBI RefSeq DB via Accession Number and they are in FASTA format. I am getting pretty desperate now since it takes way longer than it was supposed to. I am thankful for any tipps.

phylogenetic-tree bacterial-genomes • 1.7k views

ADD COMMENT • link updated 6 months ago by laczkol • 0 • written 6 months ago by Dom ▴ 10

2

Entering edit mode

Hi,

I'd go for the 16s sequences. Even if the 16s are not in the database per se, you should be able to extract them from the genome in many cases, see here: https://bioinformatics.stackexchange.com/questions/11489/looking-for-a-tool-to-find-16s-rrna-in-hundreds-of-genomes for example. A second possible approach is to identify 1:1 orthologues as annotated AA sequences from the species using e.g. prokka and orthofinder, align them and concatenate the alignments to make a phylogeny. I wouldn't recommend aligning whole genomes of unrelated species on the DNA level using standard MSA tools. Even if they could handle the amount of data the sequences are too divergent and these tools do not deal with inversions.

ADD REPLY • link 6 months ago by Michael 54k

0

Entering edit mode

Thank you! I will report back once I have done this.

ADD REPLY • link 6 months ago by Dom ▴ 10

1

Entering edit mode

It's important to know why you want to generate a species tree. What question are you trying to answer?

ADD REPLY • link 6 months ago by Mark ★ 1.5k

0

Entering edit mode

It is to showcase the relatedness of those species and strains in a project paper. It is not necessarily to answer a question but to make it easier or faster to understand. My supervisors want them also.

ADD REPLY • link 6 months ago by Dom ▴ 10

score 1 · Answer 1 · 2023-10-16

Hi,

If they are not closely related, you can try wgMLST or cgMLST. With tools like PopPUNK or chewBBACA you can compute distances both reference-free or with a given reference.

PopPUNK works with variable length k-mers to construct a core- and accessory-genome which is then used to make phylogenetic trees.

chewBBACA uses allele information on a set of genes to compute the core-genome distances.

Check the documentation which tool fits you best.

Cheers,

Michael

score 1 · Answer 2 · 2023-10-16

The first step is: don't.

You can't align whole genomes like that with any appreciable accuracy beyond 1 or two pairwise alignments. The algorithms to do this simply don't exist. 200 genomes is impossible, even if they were very closely related.

As others have said, consider MLST, 16S or (IMO) better yet, kmer and sketch based distance trees with tools like mash/mashtree.

score 0 · Answer 3 · 2023-10-16

To add to Joe's answer, I'd recommend CompareSketch in BBTools, which does both whole-genome AND 16S comparisons.

For example, I put 4 bacteria in a directory (after renaming them with the taxid for convenience):

ls
tid_1121402_Desulfobulbus_elongatus.fna.gz  tid_869814_Desulfobulbus_alkaliphilus.fna.gz
tid_1391911_Micrococcus_aloeverae.fna.gz    tid_993416_Micrococcus_cohnii.fna.gz

Then I ran this:

comparesketch.sh alltoall *.fna.gz format=3

Indexed 38716 unique and 38921 total hashcodes.
Loaded 4 sketches in 0.840 seconds.
#Query  Ref     ANI     QSize   RefSize QBases  RBases  QTaxID  RTaxID  KID     WKID    SSU
tid_1121402_Desulfobulbus_elongatus.fna.gz      tid_869814_Desulfobulbus_alkaliphilus.fna.gz    80.488  3899064       4095408 3961953 4201268 1121402 869814  0.243   0.255   95.269
tid_1391911_Micrococcus_aloeverae.fna.gz        tid_993416_Micrococcus_cohnii.fna.gz    86.888  2416945 2275564       2419301 2275595 1391911 993416  1.979   2.103   98.106
tid_869814_Desulfobulbus_alkaliphilus.fna.gz    tid_1121402_Desulfobulbus_elongatus.fna.gz      80.492  4095408       3899064 4201268 3961953 869814  1121402 0.243   0.255   95.269
tid_993416_Micrococcus_cohnii.fna.gz    tid_1391911_Micrococcus_aloeverae.fna.gz        86.885  2275564 2416945       2275595 2419301 993416  1391911 1.979   2.103   98.106

Ran 8 comparisons in 0.055 seconds.
Total Time:     0.895 seconds.

The results show you the query, ref, and then other stuff. Most notable are the ANI column - which is the approximate ANI estimated from kmer instersection - and the SSU column - which is the exact identity of the 16S via alignment. The WKID is the percent identity in kmer space.

Unfortunately, I can't guarantee that it will always find the 16S, especially if it is only present in the assembly as a partial fragment, but it does a pretty good job.

score 0 · Answer 4 · 2023-10-17

popPUNK and chewBBACCA have already been mentioned as viable alternatives. Depending on your ultimate goal, another option might be to identify the core genes of your dataset first. To do this, you would need to standardize the annotations (i.e. use the same annotation tool with the same parameters, e.g. prokka), then you could run panaroo with the option -a core to get the concatenated core gene alignments. This core genome alignment could be used as input to any software designed to reconstruct phylogenies, IQtree with model testing turned on might be a good option. I should also note that this approach is much more resource hungry than the previously proposed k-mer-based approaches.