Too Many Genes Under Positive Selection
5
1
Entering edit mode
11.6 years ago
Plantae ▴ 390

Our lab sequenced 7 genomes from the same organism, the divergence time between these genomes is less than 0.5 Mya.
~ 42000 ortholog gene families were constructed from these genomes,
to find positive selected genes,
I use codeml M2a vs M1a to test (df=2) for positive selection,

using a FDR cutoff of 0.01, I got ~16000 (38%) genes that are under positive selction,
does codeml suitable for our dataset (genes come from different strains of the same species)?

the control file i specified is:

seqfile = input.seq
treefile = input.tree
outfile = mlc

noisy = 3
verbose = 1
runmode = 0

seqtype = 1
CodonFreq = 2
clock = 0
model = 1 2

NSsites = 2
icode = 0
fix_omega = 0
omega = .9
fix_kappa = 0
kappa = .3
cleandata = 1``
paml codeml selection • 4.9k views
ADD COMMENT
0
Entering edit mode

How did you obtain individual CDS from 14000 orthologs after assembling the genomes? I'm trying to do something similar using codeml, but I only know how to obtain single genes at a time. Recently I obtained genome data so I would like to obtain all CDS and apply codeml to all of them - would you mind sharing a bit of how you process such a vast number? I use bwa mem for my assembly.

ADD REPLY
4
Entering edit mode
11.6 years ago

Have a look at the alignments. Most of the times, weird dN/dS scores are caused by errors in the alignment.

ADD COMMENT
2
Entering edit mode
11.6 years ago
Bioch'Ti ★ 1.1k

Hi, I agree with Giovanni, but you should also check for paralogs... which will inflate the ratio. Good luck

ADD COMMENT
2
Entering edit mode
11.6 years ago

This article by Will and Ziheng I think does a good job at exploring and highlighting common (alignment-based) sources of error for analyses of this kind. It would probably be useful for you to look through it.

ADD COMMENT
1
Entering edit mode
11.6 years ago
jprmachado ▴ 80

Hi,

agree with both previous replies. Consider also to filter de alignment. Try Gblocks for example, but there is more tool such as Gblocks available

Good luck

ADD COMMENT
1
Entering edit mode
11.6 years ago
Rahul Sharma ▴ 660

HI,

Finding positively selected genes is very tricky. As you have mentioned the site model (M2 and M1) comparison, have you tried other models M8 and M7? I would also try the Branch-site model (Test2) of codeml and later statistics with LRT, BC and FDR. Then would use only those genes, which are having at least one positively selected site with BEB confidence >95% or >99%. In my analysis, I first used the prank-codon alignments for MSA of the orthologs. Branch site model --> LRT, BC and FDR(Picked genes with 1% FDR) ---> Check the genes having atleast one site with >99% BEB site --> Finally got 6%, 4%, 3% and 2% of positively selected genes in four genomes. This paper I found very interesting: http://petrov.stanford.edu/pdfs/77.pdf

Regards, Rahul

ADD COMMENT
0
Entering edit mode

Sorry, but what is BC?

  Sincerely,
         Kang
ADD REPLY
0
Entering edit mode

I asked users above to but how do you obtain all the protein-coding sequences after assembling the genome? I want to try something similar to you all with codeml, but so far I only know how to assemble my genome data using a CDS reference sequence...

ADD REPLY

Login before adding your answer.

Traffic: 920 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6