Too Many Genes Under Positive Selection
5
1
Entering edit mode
9.9 years ago
Plantae ▴ 390

Our lab sequenced 7 genomes from the same organism, the divergence time between these genomes is less than 0.5 Mya.
~ 42000 ortholog gene families were constructed from these genomes,
to find positive selected genes,
I use codeml M2a vs M1a to test (df=2) for positive selection,

using a FDR cutoff of 0.01, I got ~16000 (38%) genes that are under positive selction,
does codeml suitable for our dataset (genes come from different strains of the same species)?

the control file i specified is:

seqfile = input.seq
treefile = input.tree
outfile = mlc

noisy = 3
verbose = 1
runmode = 0

seqtype = 1
CodonFreq = 2
clock = 0
model = 1 2

NSsites = 2
icode = 0
fix_omega = 0
omega = .9
fix_kappa = 0
kappa = .3
cleandata = 1

paml codeml selection • 4.2k views
0
Entering edit mode

How did you obtain individual CDS from 14000 orthologs after assembling the genomes? I'm trying to do something similar using codeml, but I only know how to obtain single genes at a time. Recently I obtained genome data so I would like to obtain all CDS and apply codeml to all of them - would you mind sharing a bit of how you process such a vast number? I use bwa mem for my assembly.

4
Entering edit mode
9.9 years ago

Have a look at the alignments. Most of the times, weird dN/dS scores are caused by errors in the alignment.

2
Entering edit mode
9.9 years ago
Bioch'Ti ★ 1.1k

Hi, I agree with Giovanni, but you should also check for paralogs... which will inflate the ratio. Good luck

2
Entering edit mode
9.9 years ago

This article by Will and Ziheng I think does a good job at exploring and highlighting common (alignment-based) sources of error for analyses of this kind. It would probably be useful for you to look through it.

1
Entering edit mode
9.9 years ago

Hi,

agree with both previous replies. Consider also to filter de alignment. Try Gblocks for example, but there is more tool such as Gblocks available

Good luck

1
Entering edit mode
9.9 years ago
Rahul Sharma ▴ 650

HI,

Finding positively selected genes is very tricky. As you have mentioned the site model (M2 and M1) comparison, have you tried other models M8 and M7? I would also try the Branch-site model (Test2) of codeml and later statistics with LRT, BC and FDR. Then would use only those genes, which are having at least one positively selected site with BEB confidence >95% or >99%. In my analysis, I first used the prank-codon alignments for MSA of the orthologs. Branch site model --> LRT, BC and FDR(Picked genes with 1% FDR) ---> Check the genes having atleast one site with >99% BEB site --> Finally got 6%, 4%, 3% and 2% of positively selected genes in four genomes. This paper I found very interesting: http://petrov.stanford.edu/pdfs/77.pdf

Regards, Rahul

0
Entering edit mode

Sorry, but what is BC?

  Sincerely,
Kang

0
Entering edit mode

I asked users above to but how do you obtain all the protein-coding sequences after assembling the genome? I want to try something similar to you all with codeml, but so far I only know how to assemble my genome data using a CDS reference sequence...