Hello,
I've been given some data to perform differential expression on, and it the process of QCing the resultant count data, I'm seeing that the library sizes have pretty big discrepancies between the 2 samples shown below. I know a good run of an illumina generates between 10-40 million reads, but is it normal for such runs to produce starkly different total reads like this? i.e.: is this an acceptable library size?
I have conducted PCA on this particular grouping and found that P70F20 is a significant outlier and removed it, so I'm also curious how much of that variability is potentially attributable to the library size? I believe DESeq uses TPM normalization, and that should control for this difference?
Any help is appreciated, I have never seen this magnitude of difference in a single grouping before. Fastqc was perfect as well, adapters were trimmed with cutadapt, alignment and counting was done using the Rsubread package in R.
Thanks!
Hello,
Thank you so much for the reply! There ended up being a huge unwanted batch effect from a particular sample prep. Removing that bad batche vastly improved the PCA clustering and downstream DEGs. I used limma::removeBatchEffect to coupled to a PCA to locate these bad samples. I did not share the whole library size barplot, just for a select group, but all the samples in this "bad" batch had counts > 4e7 so something was wrong there. Looking though our server I did find an updated run of F24 and the counts were brought up to ~2e7 so problem solved! the Voom function was very helpful in this thanks!