Question: Many paralogous genes in GO enrichment results
0
gravatar for tianshenbio
7 weeks ago by
tianshenbio50
tianshenbio50 wrote:

I performed a GO enrichment analysis for a list of DE genes. In the end, I got ~20 GO terms enriched. When I check the DE gene names under each GO term, it appears that almost all the genes have the same names or isoforms so it is reasonable that they have the same 'GO labels'...What I expect in a GO enrichment test is that a spectrum of different genes belonging to a common category would cluster together, in my case it is not...

I am totally lost at this point. Does it mean my result is not informative at all since it only reflects the DE pattern of a very limited number of genes? Is there a way to improve this?

go rna-seq enrichment gene • 103 views
ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by tianshenbio50

Based on what you describe, it would seem that the outcome is biased by redundancy, i.e. the same gene is represented by multiple IDs. This could be the case if you're working with transcript IDs. Try working with gene IDs instead.

ADD REPLYlink written 7 weeks ago by Jean-Karim Heriche22k

That's true... The problem is that I am working with a non-model organism. In the genome gff file I created, isoforms are annotated as distinct genes with distinct IDs...So I don' t know if I could pool them under the same gene...would be hard to do

ADD REPLYlink written 7 weeks ago by tianshenbio50

So I don' t know if I could pool them under the same gene

Yes, you can. You could try clustering sequences/isoforms into "genes".

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Jean-Karim Heriche22k

I am not sure how to do so...They have been annotated as distinct 'genes' in the gff file, no isoform tags, or anything I can use to distinguish them. The only thing I know is that they usually have adjacent ID numbers, and they are annotated with the same/similar gene names. Apparently I cant cluster them based on names since that would cluster distantly located paralogs as well.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by tianshenbio50

Even if you don't have the sequences, the GFF file should contain start and end positions so you could cluster things that overlap. Hard to tell what's possible or not without knowing what kind of data you have access to.

ADD REPLYlink written 7 weeks ago by Jean-Karim Heriche22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1455 users visited in the last hour