There, I want to perform transcriptional factor binding analysis. My goal to find the over-represented and under-represented TF in one gene set over another gene set.
I have two gene sets (setA and setB). One contains 1000 genes and another contains 5000 genes.
I cut the promoter region of each gene in these two gene sets.
I record the number of genes whose promoter region are bind by TF A in each gene set (for example 800 for setA and 300 for setB). I also record the number of genes whose promoter region are not bind by TF A (for example 200 for setA and 4700 for setB). I downloaded the whole TF binding profile from JASPAS database. There are over 500 TF binding profile, I just take TF A as an example here.
Now I have four numbers and perform chisq test using chisq.test function in R and get the P value.
The first question is whether the above is ok nor not?
For some reasons the length of promoter region for each genes in setA and setB cannot guarantee to be the same. Although the average length from these two gene sets is quite proximate. I think I should adjust it. Because longer promoter region should have higher binding. The second question is how I adjust it?