Question: How to estimate the minimal amount of sequencing required for a biomarker analysis ?
0
13 months ago by
Charles Plessy2.6k
Japan
Charles Plessy2.6k wrote:

I have two genomic intervals, let's call them Early and Late, and their activity is measured in raw counts in quantitative sequencing experiments; let's call these numbers E and L. In some kind of biomarker analysis, am interested to know when E > L and when E > 3L. Obviously, the total number of counts (E + L) is a function of how much budget I spend on sequencing.

I am looking for a simple way to decide what is the minimum amount of sequences for statements such as "E > L" to make sense. For instance, if E = 2 and L = 1, my experience in the field tells me that the total number of counts is too low to draw a conclusion. I have a rough intuition about some keywords that are relevant to answer my question (binomial distribution, confidence interval, Poisson noise, ...) but I am stuck. Could somebody suggest me a method to determine what is the least amount of sequencing needed to determine confidently when E > nL ?

statistics biomarker • 442 views
modified 13 months ago by Erik Arner20 • written 13 months ago by Charles Plessy2.6k

Are you going to conduct RNA-seq on those two intervals? Or is it some targeted sequencing that you are looking for?

We are using CAGE (Cap Analysis Gene Expression) libraries of virus-infected cells, and the genomic intervals are viral promoters. (And yes, targeted enrichement is also planned, but that is a different story.)

Assuming the counts are Poisson-distributed with rate r, for r sufficiently large (> ~20, but the approximation is already quite good before this, it only improves as r increases), the Poisson distribution could be approximated by a Gaussian distribution with mean r and variance r. You could also view this as testing the ratio of the rates of two Poisson distributions, for this have a look at the R package rateratio.test.

2
13 months ago by
Erik Arner20
Erik Arner20 wrote:

How about doing a binomial test, where E (or L) is the number of successes, E + L is the number of trials, and p = 0.5? In R your example with E = 2 and L = 1 would then be:

binom.test(2, 3, p=0.5)

which would not be significantly different, whereas e.g. E = 200 and L = 100 would be.

Thanks a lot Erik, so said differently, it looks like I would (for instance) need at least 100 counts if I want to be at least 95% sure that a E / L ratio of 0.58 really indicates that E > L.

> sapply(1:10 * 10, function(n) binom.test(c(n/2, n/2), p=0.5, alternative = "l")\$conf.int) %>% t %>% set_rownames(1:10 * 10)
[,1]      [,2]
10     0 0.7775589
20     0 0.6980461
30     0 0.6611073
40     0 0.6389083
50     0 0.6237541
60     0 0.6125890
70     0 0.6039339
80     0 0.5969763
90     0 0.5912285
100    0 0.5863783

Yes, but keep in mind that if you're doing multiple samples you may (will) have a multiple testing issue so you'll have to take that into account when choosing your required counts.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.