Question

Any way to clustering mixed data types in microarray data and render them with 3D scatter plot in R?

1

Entering edit mode

4.8 years ago

Jurat Shahidin ▴ 100

Hi all:

I am working with Affymetrix microarray data for my entry to microarray analysis. However, I am trying to see data points distribution within labeled groups in the 3D plot, because I want to see how similar each group of data points in 3D space. To do so, I used scatterplot3d package from CRAN to get 3D to scatter plot, didn't get the correct plot for my data.

So my guess could be the first cluster my data points that belong to different labeled groups then render them in 3D space. Here is my reproducible data that simulated from the actual dataset:

reproducible data

> dput(head(phenDat,30))
structure(list(SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01", 
"Tarca_025_P1C01", "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01", 
"Tarca_051_P1E03", "Tarca_063_P1F03", "Tarca_075_P1G03", "Tarca_087_P1H03", 
"Tarca_004_P1A04", "Tarca_064_P1F04", "Tarca_076_P1G04", "Tarca_088_P1H04", 
"Tarca_005_P1A05", "Tarca_017_P1B05", "Tarca_054_P1E06", "Tarca_066_P1F06", 
"Tarca_078_P1G06", "Tarca_090_P1H06", "Tarca_007_P1A07", "Tarca_019_P1B07", 
"Tarca_031_P1C07", "Tarca_079_P1G07", "Tarca_091_P1H07", "Tarca_008_P1A08", 
"Tarca_020_P1B08", "Tarca_022_P1B10", "Tarca_034_P1C10", "Tarca_046_P1D10"
), GA = c(11, 15.3, 21.7, 26.7, 31.3, 32.1, 19.7, 23.6, 27.6, 
30.6, 32.6, 12.6, 18.6, 25.6, 30.6, 36.4, 24.9, 28.9, 36.6, 19.9, 
26.1, 30.1, 36.7, 13.6, 17.6, 22.6, 24.7, 13.3, 19.7, 24.7), 
    Batch = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L), Set = c("PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA"), Train = c(1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Platform = c("HTA20", 
    "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", 
    "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", 
    "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", 
    "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", 
    "HTA20")), row.names = c(NA, 30L), class = "data.frame")

my attempt:

hclustfunc <- function(x) hclust(x, method="complete")
distfunc <- function(x) as.dist((1-cor(t(x)))/2)
d <- distfunc(persons_df)
fit <- hclustfunc(d)

but seems I need to group data points that belong to each individual groups (for instance, in the batch column, there are 4 different batches), then use either PCA or clustering or k-means to measure the distance then render them in 3D space with 3D scatter plot. But so far my attempts didn't bring up my expected plot.

basically, I want to see data points (a.k.a, rows) that belong to different batch (or group), wanted to color them by some 'group' attribute. I just want to see how data points are similar to each other if we grouped them by different age categories (I used findInterval(persons_df$ages, c(10,20,30,40,50))), different batch, and different platform

I am thinking to use kmeans, PCA, other methods can give me different components that can be visualized in 3D plot, but this is not very intuitive to me how to do it in R?

desired plot

I want to get 3D plot something like this:

can anyone point me out how can I possibly to make this happen? any way to get cluster my data and visualize it in 3D plot in R? Any thoughts? Thanks

R microarray clustering scatter-plot • 2.1k views

ADD COMMENT • link 4.8 years ago by Jurat Shahidin ▴ 100

score 3 · Accepted Answer · 2019-07-12

3

Entering edit mode

4.8 years ago

Jean-Karim Heriche 27k

First, read the docs of the functions you're using. hclust() does hierarchical clustering which means it produces a tree, not individual clusters. To get these, you need to cut the tree (check ?cutree).
Second, you don't need to wrap a function in another function if you're not somehow modifying it, i.e. you can just do

d  <- as.dist((1-cor(t(x)))/2)
tree <- hclust(d, method="complete")

This makes the code clearer.

Once you have a vector of cluster memberships and a vector of associated colors, you can use them to assign colors. For a 3D scatter plot, I use something like this:

library(rgl)
library(car)
scatter3d(x = PC1, y = PC2, z = PC3, surface = FALSE, groups = as.factor(clusters),  surface.col = cluster.colors, col = cluster.colors, xlab="PC1",ylab="PC2",zlab="PC3")

To recap:

Read the docs of the functions you intend to use
Cluster your data to obtain a vector of cluster memberships
Get a vector of colors you want to associate with each cluster
Plot using the appropriate syntax for the plotting function of your choice.

ADD COMMENT • link 4.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Hi Jean-Karim Heriche:

Thanks for your heads up. I want to see how data points are similar to each other if we grouped them by different age categories, different batch, and different platform. How can I do that? If I use PCA for the below data, I could get an error. I think I need to have groups of data points with labeled groups then proceed your solution above. any more thoughts? Plus, I made actual data available down below:

> dput(head(phenDat,30))
structure(list(SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01", 
"Tarca_025_P1C01", "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01", 
"Tarca_051_P1E03", "Tarca_063_P1F03", "Tarca_075_P1G03", "Tarca_087_P1H03", 
"Tarca_004_P1A04", "Tarca_064_P1F04", "Tarca_076_P1G04", "Tarca_088_P1H04", 
"Tarca_005_P1A05", "Tarca_017_P1B05", "Tarca_054_P1E06", "Tarca_066_P1F06", 
"Tarca_078_P1G06", "Tarca_090_P1H06", "Tarca_007_P1A07", "Tarca_019_P1B07", 
"Tarca_031_P1C07", "Tarca_079_P1G07", "Tarca_091_P1H07", "Tarca_008_P1A08", 
"Tarca_020_P1B08", "Tarca_022_P1B10", "Tarca_034_P1C10", "Tarca_046_P1D10"
), GA = c(11, 15.3, 21.7, 26.7, 31.3, 32.1, 19.7, 23.6, 27.6, 
30.6, 32.6, 12.6, 18.6, 25.6, 30.6, 36.4, 24.9, 28.9, 36.6, 19.9, 
26.1, 30.1, 36.7, 13.6, 17.6, 22.6, 24.7, 13.3, 19.7, 24.7), 
    Batch = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 
    6L, 6L, 6L), Set = c("PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA"), Train = c(1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Platform = c("HTA20", 
    "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", 
    "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", 
    "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "GSE113966", "GSE113966", 
    "GSE113966", "GSE113966", "GSE113966", "GSE113966", "GSE113966", "GSE113966", "GSE113966", 
    "GSE113966")), row.names = c(NA, 30L), class = "data.frame")

ADD REPLY • link 4.8 years ago by Jurat Shahidin ▴ 100

0

Entering edit mode

I am not sure I get the problem. If you need to visualize different groups, just assign them colors just as you would for clusters.

ADD REPLY • link 4.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

how would you proceed possible PCA and getting 3d scatter plot above pasted data? because can't do PCA for platform column, and I need to find out data points that belong to different batch, and two unique platforms in my data. Any possible elaboration? Thanks

ADD REPLY • link 4.8 years ago by Jurat Shahidin ▴ 100

1

Entering edit mode

So you're trying to do PCA with mixed data types. While there are ways to do it (e.g. one hot encoding or using the PCAmixdata package), I think this may not be what you want to do. If you're interested in the similarity between the samples, just compute a relevant measure of similarity/distance. One that's often used for mixed data is Gower's coefficient of similarity available in several R packages (e.g. see package proxy or the daisy() function in the package cluster). Once you have a distance matrix, you can either visualize it directly as a heatmap or apply a manifold learning method like multidimensional scaling (MDS), UMAP...

ADD REPLY • link 4.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

yeah, but even simple way of dealing with mixed data by grouping data points and render it a possible 3D plot would be fine. can you have a simple solution for that? Thanks a lot

ADD REPLY • link 4.8 years ago by Jurat Shahidin ▴ 100