Outliers detection methods for RNA-Seq data
1
0
Entering edit mode
4.8 years ago

I have a RNA-seq dataset with normalization in RPKM. The dataset have 1 gene per row with 4 different experiment condition. I need a detect de outliers values in this dataset.

I used de weka filter interquantil Range: A filter for detecting outliers and extreme values based on interquartile ranges. The filter skips the class attribute.

Outliers: Q3 + OF * IQR < x <= Q3 + EVF * IQR or Q1 - EVF * IQR <= x < Q1 - OF * IQR

Extreme values: x > Q3 + EVF * IQR or x < Q1 - EVF * IQR

My questions are:

-Exist other methods for outliers detection in this type of data ?

-I can continue to use this method for my data?

• Any suggestions?
RNA-Seq outliers rpkm • 2.5k views
0
Entering edit mode

Is there a reason why you are using the outlier approach rather than doing standard differential gene expression?

0
Entering edit mode

I will clustering the dataset. In the graphic analysis show a some high values, that values affect the cluster algorithms like k-means.

0
Entering edit mode

You usually want to see exactly that...

0
Entering edit mode

With the original data, I get some cluster with a low significance. I want detect and eliminate the outliers from my dataset, for improve the clustering algorithms results

1
Entering edit mode

That's not improving the results, it's fudging them.

0
Entering edit mode

Why?? I think is significance for my clusters

1
Entering edit mode

If you start removing points willy nilly then you can get whatever significance you want.

0
Entering edit mode

One suggestion : Instead of removing outliers, you could try to use a distance metric robust to outlier values. What comes to my mind is dist = 1-cor(x, y, method = spearman) but I must say that I never tested such a metric and I'm not 100% sure it is a good idea.

0
Entering edit mode
4.7 years ago
Whoknows ▴ 860

Hi

You could get information about outlier value by scatter plot in R.

Try to plot RPKM in ggplot scatter plot and then it shows your outliers at your data; The good point of scatter plot is, it shows correlation among your samples and also values scope. You could just remove them but consider some issues, your threshold for RPKM is very important e.g. 0.0029 is a RPKM value and 220 is RPKM as well. My code for removing outlier above 8 and less than -8 for showing in scatterplot.

ggplot(dat,aes(S1,S2))+geom_point()+ylim(8,-8)+xlim(8,-8)+geom_smooth(method = "lm")