Question

Small Sample Size (2 vs 3) in RNA-Seq DGE Shows Statistically Significant Result

0

Entering edit mode

3.8 years ago

andreasmichaelsh • 0

I have recently obtained RNA-Seq data of tumor samples from a pilot study in my lab and just finished applying Differential Gene Expression analysis on the data.

The Background: Due to some restrictions related to the fiscal year on which the funding for this experiment was budgeted, only 15 samples could be run. The person we consulted regarding the experiment advised us to use three technical replicates, something we later (after the experiment was finished) found to be unnecessary (?). As a result, our data consist of:

2 biological replicates of Condition A, each with 3 technical replicates

AND

3 biological replicates of Condition B, each with 3 technical replicates

Definition

Biological replicates: Samples from different individuals with as close as possible tumor profile and clinical confounders, each exhibiting the factor of interest, either Condition A or Condition B

Technical replicates: RNA from the same sample run on the same day (same batch)

The Result of Analysis

After summing the technical replicates' counts, using DESEQ2 we found 6 differentially expressed genes (Adj. P-value < 0.05, ) between the two conditions.

The Question

From what I understand, sample size determines power, which is the probability of rejecting the null hypothesis when in fact it is false (type II error, false negative). Am I correct in assuming that sample size does not have any effect on Beta (the probability of false positive)? I have read in this forum and in some journal articles that sample size in a RNA-Seq experiment should at least be 3 vs 3.
There have been talks of (A) just using this data; instead of (B) designing a larger experiment (which is obviously a more expensive option). Is option (A) still a scientifically (and statistically) valid option, considering the sample size?

Thank you for considering to answer my questions. This is my first post in this forum! I have just started working in the field and just browsing past questions on this forum has helped answered my questions on many occasions. Looking forward to contributing in the years to come.

Best regards,

Michael

RNA-Seq sample size differential expression • 3.0k views

ADD COMMENT • link updated 3.8 years ago by i.sudbery 19k • written 3.8 years ago by andreasmichaelsh • 0

2

Entering edit mode

3.8 years ago

Kevin Blighe 87k

Hey Michael,

From my perspective, if this is just pilot data, then the current set-up is okay, but could be better. Due to the fact that biology doesn't follow rules, having more samples permits that we 'capture' the greater variability that can exist in both a normal and disease population.

As you have probably seen, some users come here to ask about 1 versus 1 comparisons, and they have no technical or biological replicates. This is statistically possible to do, but the 'generalisability' of the results of such a comparison [to a broader population] is limited.

Your work would obviously not get published in any major journal. However, if it is merely for 'hypothesis generation', then that seems fine. The idea is that a larger study will come, correct?

Kevin

ADD COMMENT • link 3.8 years ago by Kevin Blighe 87k

0

Entering edit mode

Yes, we are planning for a larger study hopefully. It's as you said, we are trying to formulate a more specific research question based on the results of the pilot.

Thanks, Kevin, for taking the time to answer my question.

Cheers!

ADD REPLY • link 3.8 years ago by andreasmichaelsh • 0

0

Entering edit mode

You're very welcome

ADD REPLY • link 3.8 years ago by Kevin Blighe 87k

Kevin Blighe · Accepted Answer · 2020-07-26

Hi Micheal,

Just to add to what Kevin said: your study is technically "okay". That is, you've not done anything wrong. Your replicates are on the low side, given that these are samples from different human patients (which introduces a lot of variability) rather than, say, a clonal cell line. However, this is reflected in the small number of DE genes you have found.

In response to your questions 1: Low powered studies DO suffer from an increased chance of false positives. This is because as the power to detect true positives does down faster than the probability of a false positive. Imagine a situation where the power to detect true positives was 0. Any hit you got then would necessarily be a false positive!

The FDR you get from a test is an estimate of the average FDR. That is at a given threshold (5%), if you repeated the experiment an infinite number of times, the fraction of false positives in each experiment averaged across all the trials would be 5%. It doesn't guarantee that the number of false positives in a single experiment is definitely 5%.