Question

RNA-seq library size - significant sample discrepency

1

Entering edit mode

15 months ago

Luca ▴ 20

Hello,

I've been given some data to perform differential expression on, and it the process of QCing the resultant count data, I'm seeing that the library sizes have pretty big discrepancies between the 2 samples shown below. I know a good run of an illumina generates between 10-40 million reads, but is it normal for such runs to produce starkly different total reads like this? i.e.: is this an acceptable library size?

I have conducted PCA on this particular grouping and found that P70F20 is a significant outlier and removed it, so I'm also curious how much of that variability is potentially attributable to the library size? I believe DESeq uses TPM normalization, and that should control for this difference?

Any help is appreciated, I have never seen this magnitude of difference in a single grouping before. Fastqc was perfect as well, adapters were trimmed with cutadapt, alignment and counting was done using the Rsubread package in R.

Thanks!

Barplot of Library Sizes, with anticipated 20 million reads as hline

RNA-seq R DESeq2 • 1.4k views

ADD COMMENT • link 15 months ago by Luca ▴ 20

1

Entering edit mode

15 months ago

swbarnes2 14k

No, DESeq doesn't do TPM. Since it gets gene counts, and not transcript counts, I don't see how TPM would be relevant.

DESeq2 does dio a library size normalization .a 2 fold difference in library size is no big deal. 10 fold difference might be more of a problem.

ADD COMMENT • link 15 months ago by swbarnes2 14k

0

Entering edit mode

Sorry, not TPM, that is irrelevant, I just meant to say they have an internal normalization. Thanks for the help!

ADD REPLY • link 15 months ago by Luca ▴ 20

score 4 · Accepted Answer · 2023-01-21

4

Entering edit mode

15 months ago

ATpoint 82k

Differences in depth are not per se a problem. It is only a problem when depth is so low that many genes have zeros (dropouts) due to the under-sequencing. Zeros will remain zeros, regardless of the normalization method. Usually you run PCA first to see whether this sample manifests as an outlier. If so you can either remove it or downweights its influence. The latter is implemented in the limma package with the voomWithQualityWeights function.

ADD COMMENT • link 15 months ago by ATpoint 82k

1

Entering edit mode

Hello,

Thank you so much for the reply! There ended up being a huge unwanted batch effect from a particular sample prep. Removing that bad batche vastly improved the PCA clustering and downstream DEGs. I used limma::removeBatchEffect to coupled to a PCA to locate these bad samples. I did not share the whole library size barplot, just for a select group, but all the samples in this "bad" batch had counts > 4e7 so something was wrong there. Looking though our server I did find an updated run of F24 and the counts were brought up to ~2e7 so problem solved! the Voom function was very helpful in this thanks!

ADD REPLY • link 15 months ago by Luca ▴ 20