Question

outlier detcection in a given gene expression datasets

0

Entering edit mode

3.2 years ago

Mohita ▴ 50

Hello All, I am working on Gene expression data analysis. As a beginner, I started working on Microarray expression data. I want to construct a co-expression network for my analysis. But before that, I want to check the quality of samples as well as the genes of my datasets. I have seen some methods like correlation plot, clustering for detecting the outliers at sample level and MAD score at the gene level, etc. when we are performing the outlier detection at sample level, do we need to consider the conditions like normal vs. disease samples. for sample level, I have used correlation heatmap plot and cluster dendrogram method and no outlier was observed in the outcome but what I observed that some normal samples are clustered with disease samples and vice-versa. As per my knowledge, it should not be like this. I am not able to understand how to interpret these results as I am new to this field. please help me with your expert opinion on this. I shall be thankful to you all.

Regards Mohita

R • 1.0k views

ADD COMMENT • link updated 3.1 years ago by Elucidata ▴ 270 • written 3.2 years ago by Mohita ▴ 50

score 2 · Answer 1 · 2021-03-16

There are different ways to assess the quality of the gene expression data in hand. Some of the common methods are:

Boxplot – A plot that shows the distribution and skewness of the data. With a boxplot, you can assess the characteristics such as centrality and dispersion of the data, assess skewness which indicates non-normality in the data, assess the outliers in the data. Refer to the article Interpretation of Boxplot. You could use the geom_boxplot function of ggplot2 in R. You can refer to the tutorial here to make some good-looking plots.
PCA – PCA or Principal Component Analysis represents the direction of data that explains the maximum variation. Normalize the data first and then perform PCA on the dataset by using the autoplot() function in the ggfortify library. Refer to the tutorial here.
MAD Score – MAD or Median Absolute Deviation is a measure of dispersion that gives a sense of how the values are spread out in a dataset. You could use the mad() function in R.

For the outlier detection approaches, refer to Figure 1 in Robust Detection of Outlier Samples and Genes in Expression Datasets