I need some inputs in normalizing the RNA-Seq data with spike-ins and using the DESeq to retrieve differentially expressed genes from the samples. I have a condition where I have 7 samples out of which 4 samples are of peripheries that give tumor and 4 are centers of tumor. I want to normalize the raw fragment counts(which you use in DESeq) with spike-in and then compute the DEGs from it. my samples data set looks like
head(m) Sample_118p.0 Sample_132p2.0 Sample_91p.0 Sample_118rz.0 Sample_132rz1.0 Sample_132rz2.0 Sample_91rz.0 XLOC_000001 1534 2603 1764 1057 2889 3830 1684 XLOC_000002 175 304 208 144 428 367 222 XLOC_000003 80 195 109 916 2515 2314 1082 XLOC_000004 49 66 54 51 127 219 94 XLOC_000005 0 0 0 0 0 0 0 XLOC_000006 0 1 0 0 0 0 0
spike-in data set
head(sp) Sample_118p.0 Sample_132p2.0 Sample_91p.0 Sample_118rz.0 Sample_132rz1.0 Sample_132rz2.0 Sample_91rz.0 ERCC-00009 49 66 54 51 127 219 94 ERCC-00025 9 7 6 5 14 21 8 ERCC-00031 0 0 0 0 1 1 0 ERCC-00034 1 3 2 0 6 6 4 ERCC-00035 5 7 7 9 32 38 21 ERCC-00042 43 78 56 73 202 199 98
I am using the spike ins sub category B which have equal concentrations so that the consistency is maintained
Now I want to use this in DESeq.
So what is the best possible way to implement this normalization on my RNA-Seq data and create the Newcountdata set object and then estimate size factors and then the dispersion (per-gene variance) to get the Differentially expressed genes from there. Does anybody have any idea about this? It will be good if anyone has used such scenarios can give me some idea about this problem?
I'm assuming that you want to use the spike-ins simply for the size normalization, rather than estimating dispersion, correct? If so, you can actually manually set the size factors.
Thanks , I have been able to normalize my RNA-Seq data with spike-ins and then used it to
estimateDispersionsto calculate the per gene variation and then use the negative binomial test to find the DEGs, but owing to the high complexity in my data set I cannot consider the result of DESeq as they fail the multiple testing correction and just on the basis of uncorrected p-val I don't see using those genes as there is another problem where I can see the read counts for some of my comparison is 0 so the mean is also 0 and hence the fold change despite of being statistically significant , it cant be considered. Have anyone of you faced such situations? I have already tried RankProd and Cuffdiff( not good results downstream). I am now trying DESeq, don't know what to do next.
I'm not sure how highly complex your dataset is, it sounds fairly straight-forward. The presence of 0 counts isn't that uncommon, though you're most likely to see those when the counts overall are quite low, so they'll generally have crappy p-values. Without seeing enough of your data or any plots, it's rather difficult to give you any advice on how to proceed. In general, DESeq can deal with the 0-count scenario, but as you mentioned, the fold-change is not always the best metric to go by.
What have you tried so far? Where are you getting stuck in the analysis?