Question: How to group genes in a single genome into gene families?
1
gravatar for liorglic
12 months ago by
liorglic340
liorglic340 wrote:

Hello,
I'd like to group genes of a specific genome (tomato in my case) into gene families. I am using sequence similarity (not domain), specifically with the software OrthoFinder.
The input I use is the whole proteome of the species. Interestingly, I am only getting ~18% of genes grouped into gene families. The rest of the genes are singletons. This is in contrast to most works and DBs I see, where most genes (sometimes ~80%) belong to gene families. In most cases I see people using multiple genomes of various species rather than a single genome. I suspect that this is the reason for the low extent of clustering, since addition of more genomes can create new graph edges.
My question is what would be the right way to achieve my goal, which is just clustering genes from a single genome. Should I:
1) stick with what I'm doing, since this is correct from the perspective of a single genome?
2) include proteins from other species in the analysis? This seems a bit strange and would mean that my results are dependent on the species I choose to include.
3) Use some strategy that assigns genes to pre-defined gene families rather than cluster from scratch?
4) something else?

Thanks!

gene family • 242 views
ADD COMMENTlink modified 12 months ago by Renesh1.9k • written 12 months ago by liorglic340
1
gravatar for Renesh
12 months ago by
Renesh1.9k
United States
Renesh1.9k wrote:

You need to use a domain-based (PFAM) database to identify gene families. The highly conserved domains define protein functions and classify protein-coding genes into gene families. The conserved signature protein domains have the ability to detect the divergent or distantly related homologs which would be prohibitive with sequence-based similarity analysis tools e.g. BLAST. The domain-based search method would identify more genes belonging to gene families than BLAST-based homology search.

Read this manuscript: https://www.biorxiv.org/content/early/2019/08/28/272187.full.pdf

Web Tool: http://mandadilab.webfactional.com/home/

ADD COMMENTlink written 12 months ago by Renesh1.9k

Thanks for the quick answer. I'm aware of the domain-based approach, and for sure going to try it as well. My research is focused on gene duplication and family size dynamics, so I am wondering if maybe in my case similarity-based analysis is more informative. What do you think? What would be your advice (if any) for applying the similarity approach?

ADD REPLYlink modified 12 months ago • written 12 months ago by liorglic340

As I mentioned in my answer, you can use the similarity approach (BLAST) to identify gene families. But you will miss a lot of genes that can be grouped into gene families. The similarity approach would not able to identify divergent or distantly related homologs effectively. The domain-based approach is the best choice. For your understanding, you can try both approaches and identify the differences. You will identify more related genes into gene families by domain-based approach than similarity approach.

ADD REPLYlink written 12 months ago by Renesh1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1543 users visited in the last hour