I am doing phylogenetic analysis of a highly conserved protein's AA sequences over a wide range of Eukaryote taxa using MrBayes' mcmc method. MrBayes, when run in mixed mode, consistently chooses the rtREV model, while ML methods (prottest3 and iqtree choose LG(+G) or WAG+(G) as best scoring model). The rtREV phylogenies look generally "better" in both Bayesian and ML, they group clades more in accordance with expectations and manage to reconstruct more - and more sensible, bifurcations - whereas the MrBayes phylogenies are generally "better" than ML. Choice of model was independent of the multiple sequence alignment software, me editing alignments, or other parameters (+I +F +G).
rtREV was initially developed for retroviral or reverse transcribed sequences, but my sequences are not known to be retroviral. My question is simply, does that choice "mean" anything, e.g.
- Are my sequences simply very diverse, like in retroviral sequences?
- Or should I look into something else, like horizontal gene transfer via retro-viruses or retro-transposons
- Something else? Or nothing?
Hi Mensur, thank you for this helpful answer. Going through all of my runs, I noticed that I was having a wrong impression. In fact, rtREV was chosen often, but not always, sometimes also WAG is the best model. I have done quite a few runs now using different tools and parameters:
The file-names contain the aligner (muscle, mafft, clustalo, tcoffee), think I will continue with mafft or tcoffee from here.
The numbers denote different sets of ortholog sequences (7 is newest and contains most sequences).
gb: clipped with Gblocks, endclipped: manually clipped leading and trailing unaligned regions until the first conserved column.(totally abandoned gblocks, results are much worse than without) Manuallyclipped: clipped the alignment manually, removing larger low-quality regions. (abandoned this, result doesn't look good)
Number of generations was 1M, except for sequence set -2, where it was 2M, and -5 where it was 5M.
I am using 8 parallel chains at the moment. Model setting is either Invariant+Gamma (4 gamma categories or 8 (gamma8)) or IG+empirical Frequencies.
I will post the other info you requested in another comment.
Hi Mensur, I am adding the output from one representative run, where rtrev is the best model.
Overall, the run doesn't seem to converge too well. After 1M generations I got:
The average standard deviation for runs which result in WAG are somewhat higher (~0.09)
Overlay plot for both runs (1 = Run number 1; 2 = Run number 2; * = Both runs)
Chain swap information for run 1:
Nothing seems too worrisome here other than ASDSF not converging well.
Couple of other things you may want to consider. Given that in all cases you end up with posterior probability of 1.0 (pure models rather than mixtures), you may want to try using a fixed model - maybe whatever IQ-TREE's model finder tells you is the best model combination. If I am not mistaken this is a single protein, so 8 chains may be an overkill. I would consider
nchains=4
instead. You may want to increase the number of chain swaps from a default of 1 to 2 (mcmcp nchains=4 nswaps=2
), and even to 4 swaps if you stay with 8 chains. Lastly, even though the chain-swapping stats are not bad, you could increase the acceptance rate a bit by lowering the lambda heating coefficient from its default of0.1
to0.05-0.075
(something likemcmcp nchains=4 nswaps=2 temp=0.075
).It would appear that trimming has a large effect on your end results. I sometimes use trimAl with
-gt 0.5
, which will remove all the columns that have more than 50% missing residues - other than that I leave the alignment alone. A more objective entropy-based method is BMGE which will automatically select the best columns for you. You can weigh it to suit your sequences by selecting low-number BLOSUM matrices if your alignment is divergent (sounds like you need BLOSUM30), or higher BLOSUM numbers if the sequences are similar. I have had good experience with BMGE (there is an online form for it here), and it is almost a guarantee that it will leave fewer columns than either manual or trimAl trimming. That should make your tree reconstructions faster and allow you to test a larger number of parameters if needed.Thank you again, I have now experimented a little more with the parameters you proposed including setting model to mixed, and fixed wag and lg (iqtree's choice), but I rather stick with 8 chains. The main problem is still that I don't get much bifurcation (the tree is a comb effectively) for the deepest divergences in the metazoa. I am now running the best alignment (tcoffee without any trimming) with your suggested parameters, but on 8 chains, to 5M generations and see what happens.
If that run doesn't work either, I might have to break down the alignment into smaller more similar groups, possibly there is not enough phylogenetic information to resolve the deepest branches?
Here are my parameters:
Here is my best tree so far, if you are interested:
I don't mean to completely re-shuffle how you do these things, as we all have things that work for us. So take what I will say below as suggestions that may work, at least based on the information you provided.
A general suggestion: try running your reconstructions for 1-2 million generations, as it should be obvious even at that point whether they are converging or not. If they are, you can always restart the run and go to 5 million. If not, you are not wasting time and can move on to the next set of parameters. I usually look at ASDSF at about 250-300 thousand generations, and I will kill the run and start a new reconstruction if it is >0.1. If your alignment is small and runs are quick, you may want to wait until 0.5 million generations. If at that point
ASDSF >0.1
, it is almost a guarantee that something is wrong with the parameters and the convergence toASDSF < 0.01
either won't happen or it will take forever.I think the problem could be with your very low burn-in fraction is
sumt
andsump
commands. You have only 0.1% of data thrown out, which is not nearly enough. Early on during the sampling the chains are all over the place and nowhere near convergence, which is why there is a burn-in in the first place. Without throwing out a good chunk of early data (10-25%), it would very difficult, if not impossible, to get good statistics.In your
mcmc
command you don't specify the burn-in, which should default torelburnin=yes burninfrac=0.25
. It wouldn't actually hurt to specify these values on themcmc
line, just to be sure. That means that during the MCMC run the statistics are calculated by throwing away the first quarter of all your generations, while at the summarization stage only 0.1% of data is thrown out. So even if you had good statistics during the MCMC run, that would not necessarily be reflected in your run summary, or the final tree, because not enough of early faulty data has been thrown out.I suggest you add
relburnin=yes burninfrac=0.25
to yourmcmc
command. At the very least I think the burn-in fraction should be 0.1 (10% data). Whatever number you use withburninfrac
, the same number should be insumt
andsump
commands if you want to get a tree that matches your MCMC run summary.I forgot to mention that for some of your earlier runs it may be worth simply reloading the run files and doing
sumt
andsump
again with a 0.25 burn-in fraction. It would not surprise me if that gives you better statistics and more acceptable trees.Thank you for this hint, I shall try to re-run with the updated burnin parameters.
Hi, I just wanted to let you that I am now at the 10th iteration of orthologue sets and am using your parameters with mafft and tcoffee alignments. I have tried multiple trimming options also, but decided not to use any because the resulting trees are always worse. RaxML trees seem to give near perfect topology, but the same nodes that have no bifurcation in MrBayes have very low bootstrap support (1-9) in RaxML, despite earlier and later nodes having better support. iqtree however has much better branch support also in these nodes (~30-88). So, may we assume that there is really not enough phylogenetic information to resolve the deepest metazoan diversification using my gene of interest? Possibly, there was some, stronger than before and after, purifying selection on this gene during that time (~600 Mya)?
At least, the trick seems to be to use tcoffee or mafft and to not trim the alignment at all.
It is tough to make phylogenetic inference about distantly related organisms using a single gene, especially for a large number of organisms. I would consider concatenating several genes.
I don't know exactly what you have or the point you are trying to make, but it sound like the outcome is heavily dependent upon the starting alignment, and especially whether the alignment was trimmed or not. I would be hesitant to make strong claims when the end result is not always reproducible and is heavily influenced by initial parameters. It is normal that not all parts of the tree will have a strong support, and sometimes it is good enough to let the tree speak for itself.
The focus here is more about that specific protein and its extension across the tree of life which we will follow up with some experimental verification. I just wanted to include the phylogeny and show that the protein behaves "properly" for most clades (which I think it does, I would rather say that for a single gene the result is now quite ok) and where it is maybe the result of contamination/multiple organisms in the samples. I will have to write some part in the Discussion about the phylogeny, so possibly I need to speculate a little about the deeper metazoan evolution.