Question

Length bias is resequenced DNA GO analysis

0

Entering edit mode

12 months ago

dthorbur ★ 1.9k

I received comments from reviewers on a manuscript. One comment asks how gene length was accounted for in the GO enrichment analyses.

I understand that differential expression RNAseq data can be affected by gene length bias, which can be passed on to GO enrichment analyses. However, I have conducted a genome-wide scan for selection using genome-wide DNA resequencing data. We identified outlier genes putatively under selection using multiple population genomic metrics using sliding genomic windows across each discrete populations (Tajima's D, NCD, Pi, etc).

Do I need to account for gene length in my GO analyses? I can't find any literature on gene length bias in GO analyses on DNA data where we are not dealing with read counts.

I don't think so. My logic is there is that it is unimportant if we find multiple signals in longer genes since it's whether a gene is an outlier is a boolean value. Are we more likely to find a signal of selection in longer genes due to drift? Maybe, but we have neutral simulations to show that our thresholds didn't identify any neutral signals in silico.

ontology GO bias DNA • 737 views

ADD COMMENT • link updated 12 months ago by Istvan Albert 100k • written 12 months ago by dthorbur ★ 1.9k

1

Entering edit mode

12 months ago

Istvan Albert 100k

In my opinion of GO analyses, in general, the analysis is beset by so many additional problems that the length correction should be one of your least concerns.

It is much better to explore multiple approaches and reconcile those rather than trying to be really really "accurate" with one.

ADD COMMENT • link 12 months ago by Istvan Albert 100k

score 3 · Accepted Answer · 2023-04-04

First, I agree with Istvan Albert on the general limitation of ontology over-representation and enrichment analyses. The reviewer is being a bit nit-picky, IMO, with this requirement. The simplest solution is to emphasize that ontology enrichment statistics are being used to qualitatively describe potential actions of selection, not to make assertions about selected traits.

However it's probably worth thinking about what the reviewer's logic might be. I suspect the idea is that: selection operates on some haplotype of a certain size, and if you throw a large gene randomly on the genome, it's more likely than a small gene to land on the selected haplotype, since a large gene may span multiple haplotype blocks. Therefore, you would expect gene sets with a preponderance of large genes to tend to show more enrichment.

A straightforward way to address this would be to switch from a hypergeometic test (gene set over-representation style approach) to a logistic regression. There is an equivalence (in the limit) between the hypergeometric test and the model:

is.selected ~ is.in.geneset + 1

which effectively tests whether the presence of the particular gene-set increases the rate of observed selection. Gene covariates can be added to this model to control for various effects, for instance:

is.selected ~ is.in.geneset + log(gene.length) + 1

would control, to an arguably greater or lesser degree, a positive relationship between gene length and any increase in selection observations.