Question: orthoFinder software question
7 days ago
Dear all,

I have 3 species A, B, C and tried to use orthofinder to find the genes existed in A B species but absent in C species. Then I got the results from orthofinder, there is about 19691 OG number (gene families) existed in A, B species but absent in C species, I think it is too much? but not sure what's wrong with it ..

Then I tried to only use A, C species for orthofinder, and find genes existed in A species but absent in C, the results only have 23 OG number, ( if I tried B, C species, the results is also only 32 OG number), much less than the results when we put A, B, C together.

So I am quite confused why this happen and not sure is the results is realiable..any suggestions will be pretty appreciated!


that there are gonna be differences I can certainly explain, but this difference is quite huge. Are you sure you did not made a technical error somewhere in the processing/running ?

How exactly did you count the absent genes?

Hi, thanks for your reply! there is no error report, here is the script I used

module load orthofinder
module load mafft
module load fasttree

orthofinder -f prot_input1 -S diamond -M msa -T fasttree

But for interpreting the results, I maybe made a mistake. I used the file "Orthogroups.GeneCount.csv", and "Orthogroups.csv" to interpret the results, for example, for file "Orthogroups.GeneCount.csv", I extracted the OG number 0 in C species but > 0 in A, B sepcies, and think these genes are the genes existed in A, B species but absent in C.

Then I find a paper said "The orthologs and orthogroups between bottlenose catfish and channel catfish were generated using OrthoFinder (42). To obtain the species-specific genes, Protein BLAST (BLASTP) was performed in which genes not included in the orthogroups were queried against the genes in the orthogroups within the same species, with a maximal E-value of 1e-10. A reciprocal BLASTP with a maximal E-value of 1e-5 was used to query genes with no hits from previous steps (9). The genes with no hits to any orthologs were considered as species-specific genes."

So, maybe I am wrong to interpret that?

7 days ago
Illumina RNAseq is not a good source to search for gene presence / absence. First, one has to sequence really deep to reach saturation. Second, one has to sequence a number of different tissues to find genes with tissue-specific expression. Third, a transcriptome assembly from Illumina short reads is really noisy, with the number of assembled transcripts being in general an order of magnitude than "true" genes.

All these factors complicate the analysis you intend to perform, and one should be really cautious, specially about genes being called absent on the basis of RNAseq alone.

Hi, thanks for reply! I have interpreted the sequences to protein sequences, and I think we may need a verification using blastp..

