Low read counts in one of three biological replicates. Remove?
Entering edit mode
3.0 years ago
nadal-t ▴ 20

Hi folks,

I am processing RNA-Seq data (2 plant genotypes collected over 3 time points). I have 2 conditions -- control and treatment for each genotype and each condition has 3 biological replicates. For 2 out of 3 time points, I notice one biological replicate either in control or treatment has significantly low read counts compared to the other replicates of the same condition (e.g. 1-3 millions vs . 9-12 millions). I normalize the data with TMM before calling DEGs using edgeR, which I think it should handle the differences in read counts. However, no. of DEGs is almost double when I repeat the analysis without the sample with low-read count (110 vs 210 DEGs).

I am considering whether I should remove the samples with low read count from the analysis. The downside is I would have 2 biological replicates left for DEGs. Would you mind sharing your thoughts or suggestions?

Thanks a lot in advance!

RNA-Seq • 1.1k views
Entering edit mode
3.0 years ago
ATpoint 83k

A first diagnostic is to perform a PCA to see how data cluster. If then there is evidence that the low-depth sample clusters away from the other members of its condition then it might be a good idea to remove it. You basically check whether there is a data-driven reason to exclude a sample, and clustering based on read depth that cannot be compensated by normalization would be a reason for exclusion, as the difference between samples would be technical and not biological, the latter which you are usually interested in.

Very minimal code example with PCAtools and dummy data:


y <- sapply(1:9, function(x) rnorm(10000, 1000, 50))
logcpms <- cpm(y, log=TRUE)
colnames(logcpms) <- paste0("sample", 1:ncol(y))

#/ PCA with top-500 most variable genes:
pca1 <- pca(mat = logcpms[head(order(rowVars(logcpms), decreasing=TRUE), n=500),],
            metadata = data.frame(row.names = colnames(logcpms)))

#/ plot it:

enter image description here

After all, 3 is better than 2 so if only the depth is the issue, can't you get some more reads for the library so just sequencing the same sample again? The differene between n=2/n=3 seems to be notable as you describe so I guess trying to keep that one sample would be worth ivesting some $ into sequencing it a bit deeper. If you have a local facility maybe then can spike it into an existing run so the costs would be moderate.

Entering edit mode

Hi ATpoint,

Thanks a lot for sharing your suggestion and code. I agree that resequencing of the low-read samples is the best, but sadly they are old samples and we no longer have them. I just have to try to get the best out of the existing data. :(

I tried your code with data from the two time points in doubt. The samples in question is sample 5 for the first time point (PCA1=21.92%) and sample 3 for the second time point (PCA1=22.65%). For both time points, sample 1-3 are control and 4-6 are treated samples.

In my inexperience eyes, they look OK for DEG callings. Would you mind sharing your opinion? Thanks again and have a great day!

enter image description here enter image description here


Login before adding your answer.

Traffic: 3024 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6