Question

What does it "mean" when MrBayes chooses rtREV as substitution model?

0

Entering edit mode

2.5 years ago

Michael 54k

I am doing phylogenetic analysis of a highly conserved protein's AA sequences over a wide range of Eukaryote taxa using MrBayes' mcmc method. MrBayes, when run in mixed mode, consistently chooses the rtREV model, while ML methods (prottest3 and iqtree choose LG(+G) or WAG+(G) as best scoring model). The rtREV phylogenies look generally "better" in both Bayesian and ML, they group clades more in accordance with expectations and manage to reconstruct more - and more sensible, bifurcations - whereas the MrBayes phylogenies are generally "better" than ML. Choice of model was independent of the multiple sequence alignment software, me editing alignments, or other parameters (+I +F +G).

rtREV was initially developed for retroviral or reverse transcribed sequences, but my sequences are not known to be retroviral. My question is simply, does that choice "mean" anything, e.g.

Are my sequences simply very diverse, like in retroviral sequences?
Or should I look into something else, like horizontal gene transfer via retro-viruses or retro-transposons
Something else? Or nothing?

mrbayes substitution phylogeny matrix • 2.0k views

ADD COMMENT • link 2.4 years ago by Michael 54k

score 1 · Answer 1 · 2021-10-19

1

Entering edit mode

2.5 years ago

Mensur Dlakic ★ 27k

When you say that MrBayes consistently chooses the rtREV model, do you mean that sump at the end of run shows only rtREV with posterior probability of 1.0?

The fact you get a weird substitution matrix to work the best would indicate that your protein's substitution pattern does not follow general trends. I don't think there is any other meaning to it or anything you need to worry about, as long as other things indicate that this was a normal run. For example, do you get good acceptance rates of swaps between adjacent chains? Those are in the section at the end that looks like this (just above the empty diagonal):

        1     2    3    4
--------------------------
1 |        0.53 0.26 0.12
2 |  3347       0.55 0.31
3 |  3295  3337      0.53
4 |  3332  3396 3293

Does average standard deviation of split frequencies converge to a low number? Does chain sampling look random (look for the part that says Overlay plot for both runs:)?

   +------------------------------------------------------------+ -18196.08
   |      2                                                     |
   |              2                                             |
   |         2                            2          2    1  22 |
   | 1          *   1                               2  1        |
   |  2  1        1         1 1 2       2          1  1   21    |
   |*          1   2 1  2           2  1   2      2        2    |
   |   22   21   2 1 2 21           1 1  2 11            2      |
   |   1      2     2 1    12      * 12     222 2            1 1|
   | 2     1  12      21 *     2 12  2    1          12 21  1   |
   |       2              2              1   1 * 2 21           |
   |    1                 1  12               1        2        |
   |                           1       2        11            12|
   |      1                2    1       1                   2   |
   |  1  2       1           2   2                1             |
   |        1                     1                     1       |
   +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ -18206.28
   ^                                                            ^
   100000                                                       1000000

If the two chains look more like a straight line and 1s and 2s completely overlap, that would indicate something wrong with your run.

If all these indicators are in order, I think your result is reliable.

ADD COMMENT • link 2.5 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Hi Mensur, thank you for this helpful answer. Going through all of my runs, I noticed that I was having a wrong impression. In fact, rtREV was chosen often, but not always, sometimes also WAG is the best model. I have done quite a few runs now using different tools and parameters:

clustalo-6.aln.fa.mixed.IG.nex.out:      aamodel[Rtrev]            1.000          0.000          1.000          1.000
mafft-6.aln.fa.mixed.IG.nex.out:      aamodel[Rtrev]            1.000          0.000          1.000          1.000
mafft.7.aln.fa.mixed-gamma8.IG.nex.out:      aamodel[Wag]              1.000          0.000          1.000          1.000
mafft.7.aln.fa.mixed.IGF.nex.out:      aamodel[Wag]              1.000          0.000          1.000          1.000
mafft.7.aln.fa.mixed.IG.nex.out:      aamodel[Wag]              1.000          0.000          1.000          1.000
mafft.7.endclipped.aln.fa.mixed-gamma8.IG.nex.out:      aamodel[Rtrev]            1.000          0.000          1.000          1.000
mafft.7.endclipped.aln.fa.mixed.IGF.nex.out:      aamodel[Rtrev]            1.000          0.000          1.000          1.000
mafft.7.endclipped.aln.fa.mixed.IG.nex.out:      aamodel[Rtrev]            1.000          0.000          1.000          1.000
muscle-2.endclipped.faa.fa.mixed.IG.nex.out:      aamodel[Wag]              1.000          0.000          1.000          1.000
muscle-2.manuallyclipped.faa.fa.mixed.IG.nex.out:      aamodel[Rtrev]            1.000          0.000          1.000          1.000
muscle-2.shortnames.faa-gb.mixed.IG.nex.out:      aamodel[Rtrev]            1.000          0.000          1.000          1.000
muscle-5.endclipped.fa.mixed.IG.nex.out:      aamodel[Rtrev]            1.000          0.000          1.000          1.000
muscle-6.aln.fa.mixed.IG.nex.out:      aamodel[Rtrev]            1.000          0.000          1.000          1.000
muscle.clustalo-6.aln.fa.mixed.IG.nex.out:      aamodel[Rtrev]            1.000          0.000          1.000          1.000
muscle.fa.mixed.IG.nex.out:      aamodel[Wag]              1.000          0.000          1.000          1.000
muscle.shortnames.fa-gb2.mixed.IG.nex.out:      aamodel[Rtrev]            1.000          0.000          1.000          1.000
muscle.shortnames.fa-gb.mixed.IG.nex.out:      aamodel[Rtrev]            1.000          0.000          1.000          1.000
tcoffee-7.fa.mixed.IGF.nex.out:      aamodel[Wag]              1.000          0.000          1.000          1.000
tcoffee-7.fa.mixed.IG.nex.out:      aamodel[Wag]              1.000          0.000          1.000          1.000

The file-names contain the aligner (muscle, mafft, clustalo, tcoffee), think I will continue with mafft or tcoffee from here.

The numbers denote different sets of ortholog sequences (7 is newest and contains most sequences).

gb: clipped with Gblocks, endclipped: manually clipped leading and trailing unaligned regions until the first conserved column.(totally abandoned gblocks, results are much worse than without) Manuallyclipped: clipped the alignment manually, removing larger low-quality regions. (abandoned this, result doesn't look good)

Number of generations was 1M, except for sequence set -2, where it was 2M, and -5 where it was 5M.

I am using 8 parallel chains at the moment. Model setting is either Invariant+Gamma (4 gamma categories or 8 (gamma8)) or IG+empirical Frequencies.

I will post the other info you requested in another comment.

ADD REPLY • link 2.5 years ago by Michael 54k

0

Entering edit mode

Hi Mensur, I am adding the output from one representative run, where rtrev is the best model.

Overall, the run doesn't seem to converge too well. After 1M generations I got:

  mafft.7.endclipped.aln.fa.mixed.IG.nex.out:
  Average standard deviation of split frequencies: 0.048896

The average standard deviation for runs which result in WAG are somewhat higher (~0.09)

Overlay plot for both runs (1 = Run number 1; 2 = Run number 2; * = Both runs)

  +------------------------------------------------------------+ -94808.29
  |               2    1                                       |
  |                             1                              |
  |           1       1              2                     1   |
  |          1     2    12         2         1       2   2     |
  |                     2 *      1      1           2  2   2   |
  |     2   1  2  1  12         2           1  1122       1  2 |
  |        1  2 1          2 2*     2  *2  1 2 2        1 2 2 1|
  |    212  2       *2 2   111   2 1     22   2 2  1         1 |
  |            1                      2   12     1   1   1     |
  |  2     2     * 1           2  2 1       2     1   21    1 2|
  |  11   1     2        1     1                    1 1 2      |
  |11     2                 2            1         2           |
  |   2      2                                1                |
  |      1                        1                            |
  |22  1                             11                        |
  +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ -94838.59
  ^                                                            ^
  250000                                                       1000000

Chain swap information for run 1:

              1      2      3      4      5      6      7      8
       ----------------------------------------------------------
     1 |          0.28   0.04   0.01   0.00   0.00   0.00   0.00
     2 |  35450          0.27   0.05   0.01   0.00   0.00   0.00
     3 |  35755  35621          0.34   0.06   0.01   0.00   0.00
     4 |  35674  35790  35874          0.33   0.07   0.00   0.00
     5 |  35831  35727  35615  35499          0.31   0.01   0.00
     6 |  35807  35727  35674  35580  36079          0.09   0.00
     7 |  35560  35899  35867  35935  35834  35526          0.02
     8 |  35528  35861  35525  35671  35904  35481  35706

  Chain swap information for run 2:

              1      2      3      4      5      6      7      8
       ----------------------------------------------------------
     1 |          0.31   0.05   0.00   0.00   0.00   0.00   0.00
     2 |  35738          0.33   0.06   0.00   0.00   0.00   0.00
     3 |  35996  35592          0.31   0.04   0.00   0.00   0.00
     4 |  35794  35808  35953          0.24   0.02   0.00   0.00
     5 |  35989  35703  35673  35607          0.13   0.00   0.00
     6 |  35690  35793  35721  35732  35264          0.26   0.08
     7 |  36008  35458  35909  35599  35453  35624          0.41
     8 |  35714  35744  35813  35679  35399  35967  35580

  Upper diagonal: Proportion of successful state exchanges between chains
  Lower diagonal: Number of attempted state exchanges between chains

ADD REPLY • link 2.5 years ago by Michael 54k

1

Entering edit mode

Nothing seems too worrisome here other than ASDSF not converging well.

Couple of other things you may want to consider. Given that in all cases you end up with posterior probability of 1.0 (pure models rather than mixtures), you may want to try using a fixed model - maybe whatever IQ-TREE's model finder tells you is the best model combination. If I am not mistaken this is a single protein, so 8 chains may be an overkill. I would consider nchains=4 instead. You may want to increase the number of chain swaps from a default of 1 to 2 (mcmcp nchains=4 nswaps=2), and even to 4 swaps if you stay with 8 chains. Lastly, even though the chain-swapping stats are not bad, you could increase the acceptance rate a bit by lowering the lambda heating coefficient from its default of 0.1 to 0.05-0.075 (something like mcmcp nchains=4 nswaps=2 temp=0.075).

It would appear that trimming has a large effect on your end results. I sometimes use trimAl with -gt 0.5, which will remove all the columns that have more than 50% missing residues - other than that I leave the alignment alone. A more objective entropy-based method is BMGE which will automatically select the best columns for you. You can weigh it to suit your sequences by selecting low-number BLOSUM matrices if your alignment is divergent (sounds like you need BLOSUM30), or higher BLOSUM numbers if the sequences are similar. I have had good experience with BMGE (there is an online form for it here), and it is almost a guarantee that it will leave fewer columns than either manual or trimAl trimming. That should make your tree reconstructions faster and allow you to test a larger number of parameters if needed.

ADD REPLY • link 2.5 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you again, I have now experimented a little more with the parameters you proposed including setting model to mixed, and fixed wag and lg (iqtree's choice), but I rather stick with 8 chains. The main problem is still that I don't get much bifurcation (the tree is a comb effectively) for the deepest divergences in the metazoa. I am now running the best alignment (tcoffee without any trimming) with your suggested parameters, but on 8 chains, to 5M generations and see what happens.

If that run doesn't work either, I might have to break down the alignment into smaller more similar groups, possibly there is not enough phylogenetic information to resolve the deepest branches?

Here are my parameters:

begin mrbayes;
prset aamodelpr=mixed;
lset rates=invgamma;
lset ngammacat=4;
mcmc nchains=8 nswaps=4 temp=0.075 ngen=5000000 printfreq=1000 starttree=random append=no;
set seed=1234567;
sumt burnin=5000;
sump burnin=5000;
end;

Here is my best tree so far, if you are interested:

ADD REPLY • link 2.5 years ago by Michael 54k

1

Entering edit mode

I don't mean to completely re-shuffle how you do these things, as we all have things that work for us. So take what I will say below as suggestions that may work, at least based on the information you provided.

A general suggestion: try running your reconstructions for 1-2 million generations, as it should be obvious even at that point whether they are converging or not. If they are, you can always restart the run and go to 5 million. If not, you are not wasting time and can move on to the next set of parameters. I usually look at ASDSF at about 250-300 thousand generations, and I will kill the run and start a new reconstruction if it is >0.1. If your alignment is small and runs are quick, you may want to wait until 0.5 million generations. If at that point ASDSF >0.1, it is almost a guarantee that something is wrong with the parameters and the convergence to ASDSF < 0.01 either won't happen or it will take forever.

I think the problem could be with your very low burn-in fraction is sumt and sump commands. You have only 0.1% of data thrown out, which is not nearly enough. Early on during the sampling the chains are all over the place and nowhere near convergence, which is why there is a burn-in in the first place. Without throwing out a good chunk of early data (10-25%), it would very difficult, if not impossible, to get good statistics.

In your mcmc command you don't specify the burn-in, which should default to relburnin=yes burninfrac=0.25. It wouldn't actually hurt to specify these values on the mcmc line, just to be sure. That means that during the MCMC run the statistics are calculated by throwing away the first quarter of all your generations, while at the summarization stage only 0.1% of data is thrown out. So even if you had good statistics during the MCMC run, that would not necessarily be reflected in your run summary, or the final tree, because not enough of early faulty data has been thrown out.

I suggest you add relburnin=yes burninfrac=0.25 to your mcmc command. At the very least I think the burn-in fraction should be 0.1 (10% data). Whatever number you use with burninfrac, the same number should be in sumt and sump commands if you want to get a tree that matches your MCMC run summary.

sump relburnin=yes burnin=0.25;
sumt relburnin=yes burnin=0.25;

ADD REPLY • link 2.5 years ago by Mensur Dlakic ★ 27k

1

Entering edit mode

I forgot to mention that for some of your earlier runs it may be worth simply reloading the run files and doing sumt and sump again with a 0.25 burn-in fraction. It would not surprise me if that gives you better statistics and more acceptable trees.

ADD REPLY • link 2.5 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you for this hint, I shall try to re-run with the updated burnin parameters.

ADD REPLY • link 2.5 years ago by Michael 54k

0

Entering edit mode

Hi, I just wanted to let you that I am now at the 10th iteration of orthologue sets and am using your parameters with mafft and tcoffee alignments. I have tried multiple trimming options also, but decided not to use any because the resulting trees are always worse. RaxML trees seem to give near perfect topology, but the same nodes that have no bifurcation in MrBayes have very low bootstrap support (1-9) in RaxML, despite earlier and later nodes having better support. iqtree however has much better branch support also in these nodes (~30-88). So, may we assume that there is really not enough phylogenetic information to resolve the deepest metazoan diversification using my gene of interest? Possibly, there was some, stronger than before and after, purifying selection on this gene during that time (~600 Mya)?

At least, the trick seems to be to use tcoffee or mafft and to not trim the alignment at all.

ADD REPLY • link 2.4 years ago by Michael 54k

0

Entering edit mode

It is tough to make phylogenetic inference about distantly related organisms using a single gene, especially for a large number of organisms. I would consider concatenating several genes.

I don't know exactly what you have or the point you are trying to make, but it sound like the outcome is heavily dependent upon the starting alignment, and especially whether the alignment was trimmed or not. I would be hesitant to make strong claims when the end result is not always reproducible and is heavily influenced by initial parameters. It is normal that not all parts of the tree will have a strong support, and sometimes it is good enough to let the tree speak for itself.

ADD REPLY • link 2.4 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

The focus here is more about that specific protein and its extension across the tree of life which we will follow up with some experimental verification. I just wanted to include the phylogeny and show that the protein behaves "properly" for most clades (which I think it does, I would rather say that for a single gene the result is now quite ok) and where it is maybe the result of contamination/multiple organisms in the samples. I will have to write some part in the Discussion about the phylogeny, so possibly I need to speculate a little about the deeper metazoan evolution.

ADD REPLY • link 2.4 years ago by Michael 54k