Hello,
I'm processing a bulk RNa-seq dataset which was aligned using STAR, counted using Salmon. The quant.sf files were then processed by tximport into a txi object, with countsFromAbundance = "no".
A dds was created using DESeqDataSetFromTximport, then the normal DESeq2 analysis was performed. I noticed, however, an enormous amount of Differentially expressed genes:
out of 56570 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 4552, 8%
LFC < 0 (down) : 29409, 52%
outliers [1] : 0, 0%
low counts [2] : 3305, 5.8%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Also, when looking at the MA plots, the shrinkage using apeglm seems to shrink every non-significantly differentially expressed gene to 0.
plotMA(res, alpha = FDR, main = paste0("Unshrunken MA-Plot (FDR = ", FDR, ")"), ylim = c(-5,5) )
plotMA(resLFC, alpha=FDR, main = paste0("MA-plot (FDR = ", FDR, ")"), ylim = c(-5,5) )
I noticed many of the genes have a normalized count < 1.
sum(rowMeans(counts(full_dds, normalized = T))<1)
--> 11,474
Here's the corresponding pca plot (red: ConditionA, Green Condition B)
Here are the raw counts per condtion
Could these characteristics be due to the pseudo-aligner counts & the use of tximport? Would anyone be able to help me with the interpretation of this result?
Thanks in advance
Thanks for the speedy response. It's a comparison of the same type of cancer (tissue) in human with different inflammation characteristics. Differences are to be expected, but not as violent as here. There's no matching between patients possible. I added the PCA plot & a violinplot with raw reads in the original post. Filtering applied is
rowSums(counts(dds)) > 1
. I was under the impression that low counts would not contribute a lot to the analysis. I'll rerun the analysis withdds[rowSums(counts(dds, normalized = TRUE) >=5) >=3]
, which already filters out 9,424 genes.It seems A has systematically lower depth. You can color the PCA by depth and see whether that separation in early PCs is due to depth. If so you would probably need to resequence A or do some sort of subsampling normalization to downscale B. Is this A vs B in the results? If so then most downregulated genes are probably due to lower depth/dropouts in A rather than real expression differences.
This seems to be the case! Would you have an idea why differences in depth are not accounted for by sizeFactors (in this case normalizationFactors?)
Check the counts. If it is zero in A then any size factor won’t change that. It is probably close to zero or zero in some or all samples of A. That would be dropouts. 0*anything is always zero so the normalization by scaling has no effect. You can try to subsample the deeper sample (downsampleMatrix from the scuttle package can do that) or resequence to greater depth.
Thank you for the suggestion. However, there are no zero counts in the data.
rowMeans(counts(full_dds, normalized = T))
results in 11,474 values between 0 & 1. I suppose this is due to the tximport with countsFromAbundance = "no". Moreover, Michael Love suggests here not to downsample, since the sizeFactors should handle the difference in library sizes. Does this not apply to counts generated by pseudoaligners like Salmon?Your plots show that size factors do not take care of it though, depth is the biggest confounder. Try subsampling, and see whether this improves PCA, that is my suggestion. Others may have different strategies.