we're working on a RNA-Seq data set, where we used spike In to control for transcriptional amplification. After running the analyses in the standard way and not getting too many significantly regulated genes (~70), we adjusted the normalization according to the spike Ins and got over 2000 genes with significant adj. p-value.
We first calculated the size factors for the Spike In subset and gave it to the DESeq object before running the analysis.
I have a question about the number of reads we are getting after the ercc normalization.
when calculating the sum of reads before and after the ercc normalization we are getting the following numbers
sample before after
ctrl1 24762847.36 33353900.56
ctrl2 24973604.51 30803080.63
ctrl3 24727427.27 22561350.5
treat1 24875474.53 18880174.62
treat2 24911409.33 23560508.28
treat3 24588778 21308227.81
as you can see there is a clear trend of less reads in the treated samples after taking the spike Ins into account and correcting for the transcriptional bias. This goes perfectly in accord with our expectations.
Our assumption is that the reason for that is that there are less polyA transcripts in the treated data set (an assumption which we would like to verify).
therefore I have two questions -
Are we correct in assuming that the changes in the number of reads is might to less polyA transcripts in the treated samples?
Secondly, is there a statistical way to calculate this diffenrence/change in numbers and give it some kind of a statistical value to it ,so that we can say how significant these results are?