Question

Microarray outlier - how to deal? Downweigh? Remove? Re-run everything?

0

Entering edit mode

7.7 years ago

mismis • 0

Hello dear community,

we recently submitted an experiment with multiple celllines, control vs. treatment, 3 biological replicates per condition per cellline for expression profiling using Illumina HT-12 microarrays.

In the original run, we tried to counter batch effects by blocking and randomization. Unfortunately, one of the samples was not processed correctly and was recently re-run as a "filler" sample on an unrelated project.

Now while the QC on this run looks ok on the probe level (hybridization, housekeeping genes), the sample does not really fit in with its peers (please ignore HN038M). The sample in question in the dendrogram below is "Caski T 2".

pasteboard.co/GSoJ5BN.png

The question is now what one can do with such an outlier? Should we just go on with analysis? Are there recommendations for downweighting such cases in limma? Should we omit this sample alltogether - it would be 3 controls vs. 2 treatment samples then? Or worse, should we re-run the entire six samples for this cellline?

I really appreciate any input on this - thank you very much in advance!

Simon

microarray Illumina-HT12 outlier limma qc • 2.2k views

ADD COMMENT • link updated 7.7 years ago by Kevin Blighe 89k • written 7.7 years ago by mismis • 0

0

Entering edit mode

The QC generated with arrayQualityMetrics for the summarized set before normalization is here, including PCA plot and boxplots (non-normalized, looks ok as well after normalization):

QC report

As for the different dendrograms - I'll check tomorrow!

@Kevin Blighe: Thank you very much for your input! :-)

ADD REPLY • link 7.7 years ago by mismis • 0

score 1 · Answer 1 · 2017-11-06

1

Entering edit mode

7.7 years ago

Kevin Blighe 89k

A quick sweeping look at the dendrogram would infer to 99.99% of us that there are no outliers in your dataset. These 'lone hangers' buried deep in the dendrogram tree structure are commonly observed. However, your doubt about it in relation to the history of this sample leads to curiosity and, therefore, one thing that I'd ask you to do is to reproduce the dendrogram with pvclust; thus, you'd be bootstrapping the cluster dendrogram for each branch, and it would be interesting to see the boostrapped probability for the branch containing that particular sample.

I'd also ask you to reproduce the dendrogram but using Euclidean distance and trying other linkage mechanisms, like average, single/simple, and Ward's linkage ('ward.D2').

Final points:

how does it look on a PCA bi-plot comparing PC1 Versus PC2? - I imagine that it groups well with the others
How does a boxplot of normalised counts look?
How does a histogram of counts in that sample compare to the others?
How many zero count values are in the sample (and over which genes)?

If you see no other major concerns from these various things that I recommend, then I would feel confident continuing with the sample.

ADD COMMENT • link 7.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi, I tried to follow up with your suggestions - the results are below:

QC including PCA plot, boxplots, heatmap, MA-plots (all pre-normalisation): QC report

dendrogram - euclidian distance, complete linkage

dendrogram - euclidian distance, single linkage

dendrogram - euclidian distance, average linkage

dendrogram - euclidian distance, ward.D2 linkage

dendrogram - pvclust w. probabilities

So in all these dendrograms, Caski T2 seems to still be Caski but not in the same group as the other treated Caski samples. The PCA plot however does not look that bad to my novice eyes... ;-) The thing is that we don't really know what happened exactly to this replicate. Could have happened at treatment time, could be due to storage at the external facility over the last half year, could be due to the higher LOT of Illumina HT-12 slides, reagent chemistry, who knows! The main question is maybe whether inclusion of this sample would distort results too much to be meaningful.

If I read the paper by Ritchie et al. correctly, the arrayweights in limma could be of use in case we go on and include the array in question: https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/1471-2105-7-261?site=bmcbioinformatics.biomedcentral.com , right?

Thank you very much for your valuable input!

Best, Simon

ADD REPLY • link 7.7 years ago by mismis • 0

1

Entering edit mode

Hi Simon,

Yes, I saw your QC report yesterday - very interesting. In each of the dendrograms, the pattern is also consistent, with this sample hanging 'just' outside the main group of Caski samples.

To be honest, I see no justification for removing the sample from the dataset. If its branch point was located higher in the dendrogram tree structure, then you could justify it, but the fact remains that it is still on the general Caski branch of the tree. In fact, the Caski branch itself segregates at a height of ~80 (on the Euclidean distance; Ward's linkage plot). So, one could actually take the opposing opinion that the other Caski samples are abnormally similar to each other, when compared to other groups.

I neither see justification for applying any other form of weighting or adjustments apart from those already implemented in limma, such as the empirical Bayesian (ebayes()) adjustment of the statistics. I mean, this is biological data and it will never precisely conform to what we expect. There will always be deviations that can possibly only be explained by biological factors that are unknown to us, but this does not necessarily mean that we should correct for them. The array normalisation methods, which are at this point very mature methods, deal with any technical artifacts that may exist in your samples, leaving just the biological artifacts, which will be explained by the statistical methods.

I would just proceed with the datasets as it currently is, but obviously keep an eye on that sample. As a useful exercise, you could repeat everything with and without the sample in order to see what changes (if anything) - significance values may change slightly because your sample numbers then become slightly unbalanced, meaning that you may have to increase the stringency of your cut-off threshold.

Kevin

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k