I have a RNA-seq dataset with normalization in RPKM. The dataset have 1 gene per row with 4 different experiment condition. I need a detect de outliers values in this dataset.
I used de weka filter interquantil Range: A filter for detecting outliers and extreme values based on interquartile ranges. The filter skips the class attribute.
Outliers: Q3 + OF * IQR < x <= Q3 + EVF * IQR or Q1 - EVF * IQR <= x < Q1 - OF * IQR
Extreme values: x > Q3 + EVF * IQR or x < Q1 - EVF * IQR
My questions are:
-Exist other methods for outliers detection in this type of data ?
-I can continue to use this method for my data?
- Any suggestions?
Is there a reason why you are using the outlier approach rather than doing standard differential gene expression?
I will clustering the dataset. In the graphic analysis show a some high values, that values affect the cluster algorithms like k-means.
You usually want to see exactly that...
With the original data, I get some cluster with a low significance. I want detect and eliminate the outliers from my dataset, for improve the clustering algorithms results
That's not improving the results, it's fudging them.
Why?? I think is significance for my clusters
If you start removing points willy nilly then you can get whatever significance you want.
One suggestion : Instead of removing outliers, you could try to use a distance metric robust to outlier values. What comes to my mind is
dist = 1-cor(x, y, method = spearman)
but I must say that I never tested such a metric and I'm not 100% sure it is a good idea.