constant sites: 105 (27.63%)

Question

Interpreting jmodeltest2 using transposon sequences results

0

Entering edit mode

6.9 years ago

mcsimenc ▴ 20

Dear Biostars community,

I want to estimate divergence between two long terminal repeats from a single LTR retrotransposon, which are assumed to be identical when a new element is inserted. I have estimated substitutions per site this using baseml, selecting a model of sequence evolution more or less at random, but I want to justify which model of sequence evolution to use. My approach has been:

Infer a tree from a set of presumably related LTR retrotransposons using protein coding domains (e.g. reverse transcriptase, integrase)
Select a monophyletic group of LTR retrotransposons and make an alignment of the long terminal repeats
Graft the long terminal repeats onto the terminal taxa in the tree inferred in step 1
run jModeltest2 using the alignment from step 2 and tree from step 3

I did this and got a model that was most highly supported, but when I ran jModeltest2 again using the tree from step 1 and an alignment of the protein coding domain used in the inference of that tree, I get different most highly supported models. My thought is that the model estimated as best from the tree and alignment of LTRs is the one I should use, but I am unsure if there is something I'm missing. Maybe I'm going about this the wrong way. Any insights or comments are appreciated, thank you!

jmodeltest2 transposon long terminal repeats model • 1.3k views

ADD COMMENT • link updated 3.8 years ago by liangna19911314 • 0 • written 6.9 years ago by mcsimenc ▴ 20

score 0 · Answer 1 · 2020-07-08

Hi, I'am a newer using Baseml to calculate the substitutions per site, but there are some problems when analysing the results. could you give me some help. my control file are: baseml.ctl:
seqfile =seq.fas-gb treefile = enhancertree.txt

  outfile = mlb       * main result file
    noisy = 3   * 0,1,2,3: how much rubbish on the screen
  verbose = 0   * 1: detailed output, 0: concise output
  runmode = 0   * 0: user tree;  1: semi-automatic;  2: automatic
                * 3: StepwiseAddition; (4,5):PerturbationNNI 

   model = 6   * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
                * 5:T92, 6:TN93, 7:REV, 8:UNREST, 9:REVu; 10:UNRESTu

    Mgene = 0   * 0:rates, 1:separate; 2:diff pi, 3:diff kapa, 4:all diff

    ndata = 1
    clock = 2   * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis；An rooted tree should be used under 1,2,3 model. 
fix_kappa = 0   * 0: estimate kappa; 1: fix kappa at value below; 2: kappa for branches
    kappa = 2.5   * initial or fixed kappa

fix_alpha = 0   * 0: estimate alpha; 1: fix alpha at value below
    alpha = 0.5   * initial or fixed alpha, 0:infinity (constant rate)
   Malpha = 0   * 1: different alpha's for genes, 0: one alpha
    ncatG = 8   * # of categories in the dG, AdG, or nparK models of rates
    nparK = 0   * rate-class models. 1:rK, 2:rK&fK, 3:rK&MK(1/K), 4:rK&MK 

    nhomo = 1   * 0 & 1: homogeneous, 2: kappa for branches, 3: N1, 4: N2
    getSE = 0   * 0: don't want them, 1: want S.E.s of estimates

RateAncestor = 1 * (0,1,2): rates (alpha>0) or ancestral states

Small_Diff = 7e-6 cleandata = 1 * remove sites with ambiguity data (1:yes, 0:no)? * icode = 0 * (with RateAncestor=1. try "GC" in data,model=4,Mgene=4) * fix_blength = 0 * 0: ignore, -1: random, 1: initial, 2: fixed, 3: proportional method = 0 * Optimization method 0: simultaneous; 1: one branch a time

and my results are showing as: (1)Homogeneity statistic: X2 = 0.18360 G = 0.18650

Average 0.30496 0.19828 0.26761 0.22915

constant sites: 105 (27.63%)

ln Lmax (unconstrained) = -1678.510665

Distances: TN93 (kappa) (alpha set at 0.50) This matrix is not used in later m.l. analysis. (2) Detailed output identifying parameters: rates for branches: 1 0.04432rate (kappa or abcde) under TN93: 3.10512 3.05480 Base frequencies: 0.24163 0.24431 0.24245 0.27161 alpha (gamma, K=8) = 3.31196 rate: 0.30768 0.51640 0.67190 0.82302 0.98635 1.18100 1.44935 2.06429 freq: 0.12500 0.12500 0.12500 0.12500 0.12500 0.12500 0.12500 0.12500

If my ctl file is right, and is there something wrong in my out file? if the results, how could I calculated the substitution rate according to the information.

Thanks in advance.