Question

How to estimate the minimal amount of sequencing required for a biomarker analysis ?

0

Entering edit mode

6.3 years ago

Charles Plessy ★ 2.9k

I have two genomic intervals, let's call them Early and Late, and their activity is measured in raw counts in quantitative sequencing experiments; let's call these numbers E and L. In some kind of biomarker analysis, am interested to know when E > L and when E > 3L. Obviously, the total number of counts (E + L) is a function of how much budget I spend on sequencing.

I am looking for a simple way to decide what is the minimum amount of sequences for statements such as "E > L" to make sense. For instance, if E = 2 and L = 1, my experience in the field tells me that the total number of counts is too low to draw a conclusion. I have a rough intuition about some keywords that are relevant to answer my question (binomial distribution, confidence interval, Poisson noise, ...) but I am stuck. Could somebody suggest me a method to determine what is the least amount of sequencing needed to determine confidently when E > nL ?

statistics biomarker • 1.5k views

ADD COMMENT • link updated 6.3 years ago by Erik Arner ▴ 20 • written 6.3 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

Are you going to conduct RNA-seq on those two intervals? Or is it some targeted sequencing that you are looking for?

ADD REPLY • link 6.3 years ago by Shab86 ▴ 310

0

Entering edit mode

We are using CAGE (Cap Analysis Gene Expression) libraries of virus-infected cells, and the genomic intervals are viral promoters. (And yes, targeted enrichement is also planned, but that is a different story.)

ADD REPLY • link 6.3 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

Assuming the counts are Poisson-distributed with rate r, for r sufficiently large (> ~20, but the approximation is already quite good before this, it only improves as r increases), the Poisson distribution could be approximated by a Gaussian distribution with mean r and variance r. You could also view this as testing the ratio of the rates of two Poisson distributions, for this have a look at the R package rateratio.test.

ADD REPLY • link 6.3 years ago by Jean-Karim Heriche 27k

score 2 · Accepted Answer · 2017-12-14

2

Entering edit mode

6.3 years ago

Erik Arner ▴ 20

How about doing a binomial test, where E (or L) is the number of successes, E + L is the number of trials, and p = 0.5? In R your example with E = 2 and L = 1 would then be:

binom.test(2, 3, p=0.5)

which would not be significantly different, whereas e.g. E = 200 and L = 100 would be.

ADD COMMENT • link 6.3 years ago by Erik Arner ▴ 20

0

Entering edit mode

Thanks a lot Erik, so said differently, it looks like I would (for instance) need at least 100 counts if I want to be at least 95% sure that a E / L ratio of 0.58 really indicates that E > L.

> sapply(1:10 * 10, function(n) binom.test(c(n/2, n/2), p=0.5, alternative = "l")$conf.int) %>% t %>% set_rownames(1:10 * 10) 
    [,1]      [,2]
10     0 0.7775589
20     0 0.6980461
30     0 0.6611073
40     0 0.6389083
50     0 0.6237541
60     0 0.6125890
70     0 0.6039339
80     0 0.5969763
90     0 0.5912285
100    0 0.5863783

ADD REPLY • link 6.3 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

Yes, but keep in mind that if you're doing multiple samples you may (will) have a multiple testing issue so you'll have to take that into account when choosing your required counts.

ADD REPLY • link 6.3 years ago by Erik Arner ▴ 20