What parameters should be used for Seq-Gen?
1
1
Entering edit mode
7.8 years ago
rzmn ▴ 10

I am CS major with little biology background. I am doing some experiments, and at some point, I need to simulate DNA alignments along a phylogenetic tree (which is produced under Yule process using r8s program). I was able to produce "some" sequence using Seq-Gen, Seq-Gen manual, with the following parameters from that tree:

seq-gen -o p -m GTR -i 0.01 -f 0.3 0.2 0.2 0.3 -s 0.5 -z <some_seed> -l 2000

where

  • -o"output_file_format"
  • -m"MODEL"
  • -i"PROPORTION_INVARIABLE"
  • -f"STATE_FREQUENCIES"
  • -s"BRANCH_LENGTH_SCALE"
  • -z"RANDOM_NUMBER_SEED"
  • -l"SEQUENCE_LENGTH"

The problem is that when I use RAxML to obtain a maximum likelihood tree from this sequence, the resulting tree is very different from the one from which the sequence data was generated, namely their RF distance is almost maximum possible. I suspect there is something wrong with my parameters of Seq-Gen. Probably they are not biologically sensible, or I need to use more parameters like -r option to specify substitution rates? (If that matters, I used GTRGAMMA model for RAxML)

Update:

Somehow I realized when I change branch lengths of the tree, the problem is mitigated. So my #0 question is how to set branch lengths of a (say ultrametric) phylogeny such that it biologically makes sense? Is there a standard way of doing this? Any software is available for this?

phylogenetic tree seq-gen • 2.2k views
ADD COMMENT
0
Entering edit mode

Just as a sanity check, could you generate a pairwise distance matrix for the sequences that were generated? Plus remember that seq-gen is probabilistic and a stochastic process may generate sequences that are different from the coalescent tree by pure chance.

ADD REPLY
1
Entering edit mode
7.8 years ago
Brice Sarver ★ 3.8k

Something to be aware of here that's unclear from your question: you give seq-gen a tree (with branches in units of substitutions/site), and then you can simulate sequences along that tree. You're doing this, right? As your update suggests, you are using the -s option to scale your branch lengths before substitution. You probably don't want to do this if you are expecting the -exact- same tree to come out the other end.

When I've used seq-gen for simulation studies, I often find a paper for something interesting and directly specify the rate matrix itself, including any gamma-distributed rate heterogeneity (the -r/a/g options). This ensures that my values are biologically realistic. Make sure you know what data you're getting your parameters from; mitochondrial sequences, for example, evolve more quickly than nuclear sequences.

I've simulated trees under Yule and Birth-Death processes and simulated sequences on these trees, too, and you might want to consider using Tanja Stadler's TreeSim (in R) to make sure you condition appropriately (taxa and/or age) for your speciation/extinction rates, as well.

Also, you need to remove the space between -o and the argument, and the call is something like:

seq-gen [arguments] < [tree.phy] > [sequences.nex]
ADD COMMENT
0
Entering edit mode

I am using 'simulate' command in r8s to construct a tree under Yule process. This generates a topology with all branch lengths the same. Then I used seq-gen to simulate sequences along this tree. Then the tree constructed from this sequence was not similar to the initial topology. What I did to lessen the problem was multiply branch lengths by random numbers between 0 and 1 'manually', and it helped. so now I have two (perhaps stupid) questions: 1- What is the standard practice to modify branch lengths when they are the same? 2- Is specifying -a and -r options in seq-gen eliminates the need to modify branch lengths in my case?

ADD REPLY

Login before adding your answer.

Traffic: 2069 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6