Simulation of dna sequences through substitution rates
2
0
Entering edit mode
6 months ago

Hi!, I'm looking for a little bit of guidance.

My question is regarding the simulation of DNA sequences with a fix substitution rate.

The majority of the programs for simulating sequences use Continuous Markov Chain Models, with different instantaneous rate matrices that can be modified according to the free parameters of each model (JC69, HKY, GTR, etc..). For example, in HKY the free parameters are the transition/transversion (kappa) rate and nucleotide frequencies (pi).

In these programs (for example INDELible, Seq-Gen, Pyvolve), the substitution rate is expected to be represented by the branch length of the phylogenetic tree used to create the simulations.

INDELible paper.

"..Rate matrices are rescaled by INDELible such that the branch lengths represent the expected number of substitutions per site (or the average expected number of substitutions per site under a heterogeneous-sites model)." (P2 - Simulation of substitutions)

INDELible: a flexible simulator of biological sequence evolution

"...Each branch length is assumed to denote the mean number of nucleotide substitutions per site that will be simulated along that branch..." (P2 - Algorithm)

Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees

Let's say I want to simulate sequences with a mean substitution rate of 2.5 x 10^-7 per year. How would I prepare a tree that can accurately represent that substitution rate and be used in the simulation? In other words, how can the length be represented as a substitution rate accurately?

Any insight is appreciated!. Thanks in advance!

simulation rate phylogeny substitution • 296 views
0
Entering edit mode
4 months ago
Klaus S ▴ 140

In phylogentic analysis or simulations of sequences the edge lengths of the tree is often given in expected number of substitution per site. However what we are interested in is the (mean) substitution rate (e.g. 2.5 x 10^-7 per year) and the divergence time (in years). The expected number of substitutions is the product of rate and time. Unfortunately time and rate are not (easily) identifiable, but one often knows the divergence time through the fossil record and can therefore derive the substitution rate.

So for example if the distance from a species / tip to the root is one (expected number of substitution per site), this could mean that you have a (mean) substitution rate of 2.5 x 10^-7 per year and time difference of 4 million years (from tip to root), but it could also be a substitution rate of 1 and time difference of 1.

0
Entering edit mode
4 months ago

You only need to multiply branch lengths (in years) by the right factor (the rate):

the programs you cite (INDELible, etc) take a branch of length _x_, and simulate exactly _x_ substitutions per site. There is no time information in absolute metric in that is ever used by the program.

So if you say the root is at 4My, and you want to simulate a rate of 2.5e-7 substitutions/site/year, take the input tree with branch lengths in years, and multiply each branch length by the rate. The result is a tree with branch lengths in expected number of substitutions per site, that is all you need for the sequence simulation.

#### Simulation using a molecular clock model

The explanation above is a very simplified model of evolutionary rate: a strict clock (=constant rate) with no stochastic variation. Maybe you are interested in modelling more realistic clock models, for example incorporating stochasticity and rate variation. To do so you can generate your trees with the R package simclock.