Question: How to estimate the minimal amount of sequencing required for a biomarker analysis ?
0
4 months ago by
Charles Plessy2.4k
Japan
Charles Plessy2.4k wrote:

I have two genomic intervals, let's call them Early and Late, and their activity is measured in raw counts in quantitative sequencing experiments; let's call these numbers E and L. In some kind of biomarker analysis, am interested to know when E > L and when E > 3L. Obviously, the total number of counts (E + L) is a function of how much budget I spend on sequencing.

I am looking for a simple way to decide what is the minimum amount of sequences for statements such as "E > L" to make sense. For instance, if E = 2 and L = 1, my experience in the field tells me that the total number of counts is too low to draw a conclusion. I have a rough intuition about some keywords that are relevant to answer my question (binomial distribution, confidence interval, Poisson noise, ...) but I am stuck. Could somebody suggest me a method to determine what is the least amount of sequencing needed to determine confidently when E > nL ?

statistics biomarker • 269 views
modified 4 months ago by Erik Arner20 • written 4 months ago by Charles Plessy2.4k

Are you going to conduct RNA-seq on those two intervals? Or is it some targeted sequencing that you are looking for?

We are using CAGE (Cap Analysis Gene Expression) libraries of virus-infected cells, and the genomic intervals are viral promoters. (And yes, targeted enrichement is also planned, but that is a different story.)

Assuming the counts are Poisson-distributed with rate r, for r sufficiently large (> ~20, but the approximation is already quite good before this, it only improves as r increases), the Poisson distribution could be approximated by a Gaussian distribution with mean r and variance r. You could also view this as testing the ratio of the rates of two Poisson distributions, for this have a look at the R package rateratio.test.

2
4 months ago by
Erik Arner20
Erik Arner20 wrote:

How about doing a binomial test, where E (or L) is the number of successes, E + L is the number of trials, and p = 0.5? In R your example with E = 2 and L = 1 would then be:

``````binom.test(2, 3, p=0.5)
``````

which would not be significantly different, whereas e.g. E = 200 and L = 100 would be.

Thanks a lot Erik, so said differently, it looks like I would (for instance) need at least 100 counts if I want to be at least 95% sure that a E / L ratio of 0.58 really indicates that E > L.

``````> sapply(1:10 * 10, function(n) binom.test(c(n/2, n/2), p=0.5, alternative = "l")\$conf.int) %>% t %>% set_rownames(1:10 * 10)
[,1]      [,2]
10     0 0.7775589
20     0 0.6980461
30     0 0.6611073
40     0 0.6389083
50     0 0.6237541
60     0 0.6125890
70     0 0.6039339
80     0 0.5969763
90     0 0.5912285
100    0 0.5863783
``````