I'm doing a Differential Expression Analysis using DESeq2 on a large human dataset containing multiple conditions with three biological replicates per each. Since from the Quality control (FASTQC) i can see that one replicate has had a sort of technical issue (weird GC content curve, overrappresented sequences (mainly poliA) and a decreasing of the quality score from 40 bp until the end, even if the 75% of values remains always in the 'green' area); that is also reflected in a low %(around 68%) of aligned sequences (STAR). During the data exploration, taking a look to the densities plot we can clairly see that it has a different reads distribution respect the other two replicates of its condition:
At this point I've done the DEA, using DESeq2, twice: one including the 'blue' sample, one excluding it. At this point i noticed 2 things that i cannot fully explain:
In the complete analysis (so taking in account the outlier sample), after normalization , the PCA (of the first 1000 genes with largest variance on 60000) the pointed raplicate cluster well with the other two of its condition, but in the Heatmap (euclidean distance), using the entire dataset, shows that the 'blue sample' is completely different from all the others samples (all replicates of all conditions). (Plotting an heatmap only taking in account genes with higher variance it goes back to cluster with the other two replicates of its condition, that in a range for 500 to 20000, then in starts to dissociate and 'go further')
When i look at the differentially expressed genes between two conditions (one of which is the one that contains the outlier replicate) i notice that in the complete analysis (so taking it in account) i found 241 DEGs, if i exclude it 170. (139 are shared between the two analysis). Including it in the analysis should not increase the variability and so lower down the number of DEGs?
Anybody can explain me why? any hint of how to proceed?
Thanks guys, Fabio