I am new to microarray analysis and I have a question regarding some of my samples. I have 192 samples collected from 15 human subjects at several time points, which were analyzed on Affymetrix Clariom chips. I used the function rma() in R to normalize the samples and subsequently looked at the samples by Principal Component Analysis. My results look like this:
PC1 accounts for ~19% of the variability, PC2 for ~12%. I looked into possible variables that might explain the PCs, but nothing came up other than that the "cloud" on the left is made up of samples from 3 distinct subjects (but these subjects have samples in the righthand cloud too). Therefore, at first, I didn't do anything with these results, didn't exclude any samples, and proceeded with my downstream analysis.
However, when plotting the log2 expression levels over time, I found that some samples show up as a cloud of "outliers" in a substantial fraction of the transcripts. This figure shows some randomly selected transcripts, in some of which the cloud of outliers are visible (e.g. Gene0417, Gene1448, Gene2202, Gene6601), whereas in others, it isn't clear at all:
Just to be clear: The different colours in the gene plots refer to the different subjects that were included in the study. The blue dots in my PCA plot refer to the same samples as those numbered in the gene plots. I marked the blue points manually, so I didn't use a test to see if they are outliers (any advice on how to do this more formally is very welcome).
My questions are:
What could possibly explain these results?
What to do with this? In my downstream analysis, I feel that many show significant changes over time just because of these samples.
Thank you for your help!