Question

Strange results, in term of DEGs, when including an outlier sample (DEA with DESeq2); someone knows why?

0

Entering edit mode

6.8 years ago

fab.sands • 0

Hi everybody,

I'm doing a Differential Expression Analysis using DESeq2 on a large human dataset containing multiple conditions with three biological replicates per each. Since from the Quality control (FASTQC) i can see that one replicate has had a sort of technical issue (weird GC content curve, overrappresented sequences (mainly poliA) and a decreasing of the quality score from 40 bp until the end, even if the 75% of values remains always in the 'green' area); that is also reflected in a low %(around 68%) of aligned sequences (STAR). During the data exploration, taking a look to the densities plot we can clairly see that it has a different reads distribution respect the other two replicates of its condition:

Density plot

At this point I've done the DEA, using DESeq2, twice: one including the 'blue' sample, one excluding it. At this point i noticed 2 things that i cannot fully explain:

In the complete analysis (so taking in account the outlier sample), after normalization , the PCA (of the first 1000 genes with largest variance on 60000) the pointed raplicate cluster well with the other two of its condition, but in the Heatmap (euclidean distance), using the entire dataset, shows that the 'blue sample' is completely different from all the others samples (all replicates of all conditions). (Plotting an heatmap only taking in account genes with higher variance it goes back to cluster with the other two replicates of its condition, that in a range for 500 to 20000, then in starts to dissociate and 'go further')
When i look at the differentially expressed genes between two conditions (one of which is the one that contains the outlier replicate) i notice that in the complete analysis (so taking it in account) i found 241 DEGs, if i exclude it 170. (139 are shared between the two analysis). Including it in the analysis should not increase the variability and so lower down the number of DEGs?

Anybody can explain me why? any hint of how to proceed?

Thanks guys, Fabio

rna-seq DESeq2 DEA outlier • 1.7k views

ADD COMMENT • link 6.8 years ago by fab.sands • 0

1

Entering edit mode

The whys I do not fully understand, but as for the how, for me it is clear you should remove the outlier. I am of this opinion not because the DGE analysis results are different with or without these samples, but because there are enough quality control red flags (GC content, over-represented sequences, mapping rate, densities plot). Some argue you need more than 3 replicates per treatment (at least 5) to recognize outliers, but I think this applies to recognizing outliers using the statistical results, not pre-analysis quality control.

My attempt at the whys: although this sample is problematic, I suppose the signal of differentially expressed genes is stronger than the noise from this sample, so heatmap / PCA with the differentially expressed genes shows samples clustering together, but for all genes, then noise beats the signal. The 71 additional (241-170) differentially expressed genes I would consider as false positives, genes that were considered significant but the main cause for this is the noise from the outlier sample.

ADD REPLY • link 6.8 years ago by h.mon 35k

0

Entering edit mode

I looked more deeply into the statistics, taking a look to the mean, adjusted pvalue, and log2fc distributions of the uniquely DEGs in one case (let's say excluding the sample) in the other pairways analysis (with all the sample) and vice versa. What I saw was exactly that that! excluding the outlier I gain DEGs that the problematic sample exclude just lowering down the log2fc (at values like 1,8 so are excluded by my cutoff) but if I include the outlier I gain DEGs that in the without the samples have a mean adjusted pvalue of 0.3 and and a mean log2fc of 0. Thank you a lot for your answer!

ADD REPLY • link 6.8 years ago by fab.sands • 0