Question

Low read counts in one of three biological replicates. Remove?

0

Entering edit mode

2.8 years ago

nadal-t ▴ 20

Hi folks,

I am processing RNA-Seq data (2 plant genotypes collected over 3 time points). I have 2 conditions -- control and treatment for each genotype and each condition has 3 biological replicates. For 2 out of 3 time points, I notice one biological replicate either in control or treatment has significantly low read counts compared to the other replicates of the same condition (e.g. 1-3 millions vs . 9-12 millions). I normalize the data with TMM before calling DEGs using edgeR, which I think it should handle the differences in read counts. However, no. of DEGs is almost double when I repeat the analysis without the sample with low-read count (110 vs 210 DEGs).

I am considering whether I should remove the samples with low read count from the analysis. The downside is I would have 2 biological replicates left for DEGs. Would you mind sharing your thoughts or suggestions?

Thanks a lot in advance!

RNA-Seq • 1.0k views

ADD COMMENT • link 2.8 years ago by nadal-t ▴ 20

score 0 · Answer 1 · 2021-07-28

A first diagnostic is to perform a PCA to see how data cluster. If then there is evidence that the low-depth sample clusters away from the other members of its condition then it might be a good idea to remove it. You basically check whether there is a data-driven reason to exclude a sample, and clustering based on read depth that cannot be compensated by normalization would be a reason for exclusion, as the difference between samples would be technical and not biological, the latter which you are usually interested in.

Very minimal code example with PCAtools and dummy data:

library(PCAtools)
library(edgeR)

y <- sapply(1:9, function(x) rnorm(10000, 1000, 50))
logcpms <- cpm(y, log=TRUE)
colnames(logcpms) <- paste0("sample", 1:ncol(y))

#/ PCA with top-500 most variable genes:
pca1 <- pca(mat = logcpms[head(order(rowVars(logcpms), decreasing=TRUE), n=500),],
            metadata = data.frame(row.names = colnames(logcpms)))

#/ plot it:
biplot(pca1)

enter image description here

After all, 3 is better than 2 so if only the depth is the issue, can't you get some more reads for the library so just sequencing the same sample again? The differene between n=2/n=3 seems to be notable as you describe so I guess trying to keep that one sample would be worth ivesting some $ into sequencing it a bit deeper. If you have a local facility maybe then can spike it into an existing run so the costs would be moderate.