I have a data.frame from a rna-seq experiment, and I would like to remove some outliers. The data is huge with 350 samples and 32291 genes. The data are log2 RPKM values (I did the log2 because I am planning to do
WGCNA analysis and the authors recommend to make a log2 transformation of the data).
I am using the
PcaHubert function from
rrcov package to find outliers, here is the code I am using:
df <- read.table("/path/to/file/rpkm.txt") dim(df) #32291 352 df <- df[,-c(1,2)] # first 2 columns have accessory data library(rrcov) pcaHub <- PcaHubert(t(df)) outliers <- which(pcaHub@flag=='FALSE')
The outliers would be those samples with the flag `FALSE` after doing the RobustPCA, do you think it is appropriate to remove outliers using this method?
Any comments would be greatly appreciated