I would like to seek your expert opinions regarding the use of principal component analysis (PCA) in The Cancer Genome Atlas (TCGA) dataset. Specifically, I would like to discuss the issue of sample separation in the PCA plot obtained after normalizing the COAD data using limma or DESeq2. As you may have observed, some samples are separable while others are not, possibly due to factors such as heterogeneity. In this regard, I would appreciate your insights on whether we should exclude these non-separable samples from further processing or include them in our analysis.
No, you should definitely not remove samples from a PCA on the basis that they are not separable between conditions. To do so would artificially reduce the estimated variance between samples, and lead to an artificial increase in average log-fold changes, and inflate p-values. The heterogeneity here could be reflecting technical issues, but it could reflecting genuine biological heterogeneity, and you shouldn't just pretend that that doesn't exist.
Any outlier removal scheme should be independent of the difference between conditions. DESeq2 has built in outlier detection and removal on a per gene basis. On a per sample basis you might have something like removing samples whose PC1/PC2 values are more than 3 or 4 standard deviations away from the mean of the same condition.
You might also try reducing hetrogenity by using something like Combat or SVA.
Can you upload a picture of the PCA biplot for the benefit of those that have not seen it.