I would like to test if genes containing at least one transcription factor (say MEF2A) binding sites are enriched for certain category.
I could easily come up with a TF-containing-gene list by intersecting TF binding sites bed files with gene annotation bed files, and send for enrichment study.
But question is: if one gene is big, naturally it tends to be more likely to contain TF binding sites. So should I first control gene size?
So I should normalize by assigning one parameter to each gene as: (overlap size)/(gene size) ? And then sort and select say the top 200 or 500?