Question: What parameters should be used for Seq-Gen?
gravatar for rzmn
4.1 years ago by
rzmn10 wrote:

I am CS major with little biology background. I am doing some experiments, and at some point, I need to simulate DNA alignments along a phylogenetic tree (which is produced under Yule process using r8s program). I was able to produce "some" sequence using Seq-Gen, Seq-Gen manual, with the following parameters from that tree:

seq-gen -o p -m GTR -i 0.01 -f 0.3 0.2 0.2 0.3 -s 0.5 -z <some_seed> -l 2000


  • -o"output_file_format"
  • -m"MODEL"

The problem is that when I use RAxML to obtain a maximum likelihood tree from this sequence, the resulting tree is very different from the one from which the sequence data was generated, namely their RF distance is almost maximum possible. I suspect there is something wrong with my parameters of Seq-Gen. Probably they are not biologically sensible, or I need to use more parameters like -r option to specify substitution rates? (If that matters, I used GTRGAMMA model for RAxML)


Somehow I realized when I change branch lengths of the tree, the problem is mitigated. So my #0 question is how to set branch lengths of a (say ultrametric) phylogeny such that it biologically makes sense? Is there a standard way of doing this? Any software is available for this?

seq-gen phylogenetic tree • 1.4k views
ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by rzmn10

Just as a sanity check, could you generate a pairwise distance matrix for the sequences that were generated? Plus remember that seq-gen is probabilistic and a stochastic process may generate sequences that are different from the coalescent tree by pure chance.

ADD REPLYlink written 4.1 years ago by Gabriel R.2.7k
gravatar for Brice Sarver
4.1 years ago by
Brice Sarver3.5k
United States
Brice Sarver3.5k wrote:

Something to be aware of here that's unclear from your question: you give seq-gen a tree (with branches in units of substitutions/site), and then you can simulate sequences along that tree. You're doing this, right? As your update suggests, you are using the -s option to scale your branch lengths before substitution. You probably don't want to do this if you are expecting the -exact- same tree to come out the other end.

When I've used seq-gen for simulation studies, I often find a paper for something interesting and directly specify the rate matrix itself, including any gamma-distributed rate heterogeneity (the -r/a/g options). This ensures that my values are biologically realistic. Make sure you know what data you're getting your parameters from; mitochondrial sequences, for example, evolve more quickly than nuclear sequences.

I've simulated trees under Yule and Birth-Death processes and simulated sequences on these trees, too, and you might want to consider using Tanja Stadler's TreeSim (in R) to make sure you condition appropriately (taxa and/or age) for your speciation/extinction rates, as well.

Also, you need to remove the space between -o and the argument, and the call is something like:

seq-gen [arguments] < [tree.phy] > [sequences.nex]
ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Brice Sarver3.5k

I am using 'simulate' command in r8s to construct a tree under Yule process. This generates a topology with all branch lengths the same. Then I used seq-gen to simulate sequences along this tree. Then the tree constructed from this sequence was not similar to the initial topology. What I did to lessen the problem was multiply branch lengths by random numbers between 0 and 1 'manually', and it helped. so now I have two (perhaps stupid) questions: 1- What is the standard practice to modify branch lengths when they are the same? 2- Is specifying -a and -r options in seq-gen eliminates the need to modify branch lengths in my case?

ADD REPLYlink written 4.1 years ago by rzmn10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1177 users visited in the last hour