Question

Gene length bias for ontology analysis.

1

Entering edit mode

8.0 years ago

michealsmith ▴ 790

I need to study the genomic distribution of certain transposon elements. So I first retrieve the information of the transposon element from repeatmasker in bed format (chr:start-end), then intersect with hg19 gene bed file. My purpose now is to figure out genes containing at least one such transposon would be enriched for certain categories or not, using GO term for example.

For instance:

GeneA: chr1:20000-50000
containing two transposonD: 
chr1: 25000-26000
chr1: 31000-32000

GeneB: chr3: 40000-80000
containing one transposonD:
chr3: 60000-62000

My question is should gene length bias be taken into account? One huge gene is naturally more likely to contain more transposon elements. Or GO term has already taken account of this?

I searched literature and found discussion about length bias for RNA-seq data, but not for my problem here. Thanks

gene ontology • 1.7k views

ADD COMMENT • link updated 7.9 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.5k • written 8.0 years ago by michealsmith ▴ 790

score 4 · Answer 1 · 2016-05-10

To my knowledge GO terms are curated based on the function of genes (and the protein's cellular location). Gene length should not matter in your GO analysis.

However you do need to control for long genes (many are neuronally expressed) with transposable elements.

Sounds like a permutation is in order. If you randomly intersect the transposon library to the genome X number of times, do you expect to see the same enrichment of GO terms with the null?