I have around 200 genes from one species for which I want to study the evolutionary forces acting on these genes (may be positive selection) on orthologs of 20 sequenced genomes.
As a first step, I did all-vs-all blast of these 20 species and tried to classify these genes into gene families from where I can get the orthologs and paralogs of these genes. I used two different programs to do this job 1). ortholomcl, from where I can get nicely categorized output in the form of orthologs, inparalogs and co-orthologs and the tool is based on graph based heuristics. First, I used the orthomcl with default parameters (evalue 1e-5 and percent identity 50%). The tool seems to work fine for the genes with large families but not for the genes with small families. Further, I went up and down with the orthomcl parameters but not much effect on these families, genes with the small families were classified into different families. Please note when I talk about small and large gene families and family size, this is based on some known gene families in my genome that are well characterized.
So I decided to try with another tool 2). silix which simply gives the genes clustered into families and is based on finding similarity across a linked network. But this does not seems to work fine for the large families. As it keeps picking up more and more domains for large families, e.g. for a family that actually has size of 74 genes, silix reports this family as 5000 genes. Orthomcl gave correct results in these case.
Here is one paper that shows different genes families require different program parameters for correct resolution and has given the strategy to classify the gene families in newly sequenced species by using the information from the known gene families in model species. But using their approach for my work has two drawbacks: first, this is not possible to do this work for 200 genes. Second, this has not addressed if the gene family is not classified well before and not all the gene families are well characterized.
Can someone help me by suggesting the best approach to classify my genes into gene families. As different gene families require different parameters, will this work if I give one stringency criteria for all the genes?
Hope I am clear in asking the question. Kindly let me know if something I misunderstood about selection studies.