Hi all,
My samples looks separated on PC2 level. What would it mean if they were clustered on PCA1 level? I think it means lesser variation(since PC2) is accounted between sample groups. I attached my PCA plot.
Also, is it logical to remove samples that are mixed(not clearly separated in PC2 level in this example)? I think, maybe it would give me a more realistic results since the mixed samples are an artifact of libary prep- I assume. And I have many samples, so even though I remove samples I can still make statistical interpretations?
I am learning those all by myself recently, so I am sorry if it is a trivial and nonsense question.
Hope BioStars community helps me clarify!
Yes, I will check those two samples. But my main concern right now is PC2 level clustering. Some of the allergic samples are mixed with control samples on the left bottom part of the plot. How should I interpret it and should I remove allergic samples which are mixed with control samples?
No, you should definitely not remove the allergic samples that are mixed with the control samples on the basis of this plot. You say they are an artifact of library preparation. How do you know this? Were the prepared using a different library prep. To me it just suggests that inter-individual variation is bigger than inter-condition variation.
I didn't know it was an artifact but I assumed. Your last note makes the plot much more meaningful for me.
Could not removing the mixed allergic samples mask the potential differential expression between allergic and control groups? In other words, how can that inter-individual variation can affect differential expression analysis?
Differential expression works by testing whether the means of the treatment population could conceivably be the same as the mean of the control population. This relies on knowing how accurately the means of each sample represents the mean of the population. To estimate this, we use the inter-individual variation - large inter-individual variations = low confidence in population mean.
So if your means are 100(+/- 100) and 200(+/- 100), you can't say if they are really different or not, but if they are 100(+/- 10) and 200 (+/- 10), then you can.
But that inter-individual variation is not an artefact. Humans vary a lot from each other, and it is unlikely that whatever your condition really doesn't affect everyone in the same way.
A simplied situation might make this clear. Let imagine I am interested in testing whether green-eyed people are taller or shorter than brown-eyed people. I collect 20 people with each colour and measure their heights. But I see that the heights of green eyed people overlap the heights of brown-eyed people. Oh no! I'd better remove the short green-eyed people and tall brown eye people and only compare tall green-eyed people to short brown-eyed people!
We can obviously see that this is wrong, but it is not really any different from removing overlapping samples from a PCA.
I wouldn't worry about the PCA plot too much. You can still find some DE genes, even when your samples are completely overlapping, it just means that its not the main thing going on in your transcriptomes. I probably would remove the two massive outliers on PC1 - something is clearly odd about those samples and the axis is not correlated to the condition in question.
Many thanks again for this detailed comment! There is no doubt now.
They might be separated by PC3, who knows. Since the first PC is separating the two outliers than the rest, removing them should change the PCA and you might get a separation. Even if not, you shouldn't remove these samples.
Thanks for emphasizing that I shouldn't remove samples. I will check out PC3!