**2.6k**wrote:

I have two genomic intervals, let's call them *Early* and *Late*, and their activity is measured in raw counts in quantitative sequencing experiments; let's call these numbers *E* and *L*. In some kind of biomarker analysis, am interested to know when *E > L* and when *E > 3L*. Obviously, the total number of counts (*E + L*) is a function of how much budget I spend on sequencing.

I am looking for a simple way to decide what is the minimum amount of sequences for statements such as "*E > L*" to make sense. For instance, if *E = 2* and *L = 1*, my experience in the field tells me that the total number of counts is too low to draw a conclusion. I have a rough intuition about some keywords that are relevant to answer my question (*binomial distribution*, *confidence interval*, *Poisson noise*, ...) but I am stuck. Could somebody suggest me a method to determine what is the least amount of sequencing needed to determine confidently when *E > nL* ?

**20**• written 13 months ago by Charles Plessy •

**2.6k**

Are you going to conduct RNA-seq on those two intervals? Or is it some targeted sequencing that you are looking for?

230We are using CAGE (Cap Analysis Gene Expression) libraries of virus-infected cells, and the genomic intervals are viral promoters. (And yes, targeted enrichement is also planned, but that is a different story.)

2.6kAssuming the counts are Poisson-distributed with rate r, for r sufficiently large (> ~20, but the approximation is already quite good before this, it only improves as r increases), the Poisson distribution could be approximated by a Gaussian distribution with mean r and variance r. You could also view this as testing the ratio of the rates of two Poisson distributions, for this have a look at the R package rateratio.test.

17k