8.0 years ago by
There are certainly competing views for how to answer these questions ... So I'll present my thoughts but take them with a grain of salt.
- why not just use the entire genome as a control - since that is the
entire population. What is the use of introducing sampling error at this
Because this may not be appropriate. For example in microarray experiments not every gene in the genome is measured. So the background should be all genes on the chip (everything that you could have measured). With modern chips this is not as big of a deal since most chips contain all genes but when dealing with historical data (or data from custom chips ... ie. chips which only measure "drug-able targets") its very important to know what the background is.
- The size of a control set - should it be 10X the size of the experimental set? What are some heuristics for choosing a size?
I assume by "control set" you mean the "normal" samples. In order to calculate this you need some idea of the effect size that you're actually trying to measure. Obviously the smaller the effect the more samples you'll need. However, I have yet to see a genomics experiment which actually sets forth their reasoning for choosing the number of experimental-to-normal samples ... most are simply chosen based on budgetary reasons.
- The appropriate statistic for comparing discrete annotational counts (Fisher's Exact Test, chi-square test, or glm)
I personally prefer hypergeometric or Fisher's Exact (depending on size constraints) however nowadays I don't actually calculate my own enrichment values. Keeping annotations up-to-date is a full time job. So I use the DAVID tool which calculates a hypergeo, fisher's test and an EASE score which uses some sort of multiple-testing correction.
Hope that helps.
8.0 years ago by
Will ♦ 4.5k