Question

Running PAML without Tree file?

0

Entering edit mode

5.3 years ago

sunnykevin97 ▴ 980

Hi

After exploring little bit about PAML for calculating dN/dS rate two files are required, as an input codeml.ctl file

1) MSA in phylip format 2) Tree file.

My question is, already I know the phylogeny of my genome data. I'm interested in calculating only dN/dS rate among the each genome. How I do it ?

How to write a tree file manually, is it possible to write a tree file by looking in to the previous phylogeny ? Else, Is it possible to run codeml program with out tree file ?

suggestions please!

sequence alignment • 2.1k views

ADD COMMENT • link updated 5.3 years ago by Joe 21k • written 5.3 years ago by sunnykevin97 ▴ 980

score 1 · Answer 1 · 2019-01-16

1

Entering edit mode

5.3 years ago

Joe 21k

You need to calculate the tree empirically because things like branch lengths may be important. It would be possible to hand-write a tree with appropriate topology, but without distances it may give you false results.

Furthermore, in an ideal world, the tree you use to describe your data should be derived directly from the accompanying MSA.

ADD COMMENT • link 5.3 years ago by Joe 21k

0

Entering edit mode

Thanks, totally I have 18 genome data-sets to estimate the dN/dS using PAML. Firstly, I'll align all the genomes using CLUSTAL and I'll generate a MSA file in phylip format (any good tools which handle big data-sets) ? But, how I'll generate a tree file to run PAML ? any tools ? data-sets are more in number is it a problem ?

ADD REPLY • link 5.3 years ago by sunnykevin97 ▴ 980

3

Entering edit mode

CLUSTAL absolutely will not be able to handle full genome-scale alignments. There are few tools that really can. LASTZ is one of the few tools I've seen that deals with large sequences but even then, 18 is probably too many, and I don't know how big your genomes are (in my experience its alignments are kinda crappy too).

For dN/dS, the CDSs are the only thing that matters anyway, so I think a better approach would be to retrieve all CDSs for each genome, cluster the orthologs together (e.g. via CD-HIT or similar) to generate an alignment and tree, and then calculate a dN/dS for each gene (someone with more experience can absolutely correct me).

Once you have that you could work out an average value across the genome, or maybe even plot the dN/dS across the sequence to see which regions are more 'evolutionarily active'.

Its a few more steps, and will require some heavy duty parallel processing of all the genes, but its the only way I can think you'd do it.

ADD REPLY • link 5.3 years ago by Joe 21k

0

Entering edit mode

Well, LASTZ is for pairwise comparison. But, I interested in multiple genome alignment ? My genome's are too big.

suggestions please. thanks!

ADD REPLY • link 5.3 years ago by sunnykevin97 ▴ 980

0

Entering edit mode

Yeah, there's essentially no such thing as a genome-scale multiple alignment tool. Your approach simply isn't possible. Under other circumstances (estimating distance for instance) I'd suggest you could get by with multiple pairwise alignment, but for dNdS that isn't the case.

You could select a subset of genes of interest to base your analysis on, but whatever genes you choose will lead to an under or over estimation of the evolutionary rate - hence why I suggest doing all/as many as possible.