Make large matrix data interpretable. How?
1
0
Entering edit mode
10 weeks ago
Ankit ▴ 390

Hi,

I have a big matrix of quantitation value (range 0 to 1). The x-axis and y-axis are different proteins. I want to categorise and represent the data into different clusters/groups. What is the best way to do that?

Some possibilities I thought:

Correlation matrix plot (too big)

Heatmap (too big)

PCA (will be messy)

Dendrogram (not so useful)

Any strategies ?

Thank you

R matrix • 817 views
2
Entering edit mode

Can you explain your data in more detail, and the general question(s) you're trying to answer? It would be difficult to give an answer without knowing more.

2
Entering edit mode

Unless you provide more information on the data itself (how many dimensions does one feature have?) and your hypothesis, it is hard to answer your question. Have you already considered applying either t-SNE or UMAP dimension reduction techniques, which may be more suitable than PCA?

0
Entering edit mode

Thank you guys!

I will look into t-SNE and UMAP.

Basically I have dataframe of values which I calculated as follows:

## Step 1.

Common numbers of genes bound by protein1 + protein 2 Divided by Total genes bound by protein1

Similarly for protein2,

In this way I created a matrix for ratio 0 to 1. For different combinations.

## Step 2.

Next I want to categories N number of proteins which has more/less shared genes than other with protein1. And so on...for other respective proteinsN.

## Step 3.

Then I want to predict a network of proteins which could regulate same genes. Basically I want to use step2 data and get some list output that carry set of possible proteins which could co-regulate the Gene 1, Gene 2 and so on..(hypothetically in sillico support)

## Step 4.

We will test co-ip or immunoprecipitation for some genes of interest in wt and ko .

Dimensions are 560 X 560 (all against all)

Value range 0 to 1 , with 0 no genes are shared and 1 genes bound are exactly the same

2
Entering edit mode
10 weeks ago
Mensur Dlakic ★ 23k

The more information you provide, the more likely it is we can offer good suggestions. In particular, what is a "large matrix" and what exactly "interpretable" means to you? The answer is different for a matrix that is 500-1000 proteins large (I will assume this) versus a matrix that is 100,000 proteins large. The answer is also different if by interpretable you mean a global number of protein groups (I will assume this) versus understanding individual protein relationships on a more granular level.

All the strategies you proposed are viable. Dimensionality reduction with PCA may work, but I will give you two examples of doing it with t-SNE. A principal difference is that PCA globally preserves distances between data points but doesn't always separate data points clearly as it relies on linear relationships, while t-SNE is non linear and often separates points better, but preserves only their local relationships. If you want to try t-SNE, I recommend openTSNE.

Both embeddings below are for 699 proteins. In the first case we start from a symmetric distance matrix, which partially looks like this:

vector_0001,vector_0002,vector_0003,vector_0004,vector_0005  ....
0.000000,0.652870,0.648000,0.675000,0.639257,0.640854,0.685039,0.636119  ....
0.652870,0.000000,0.373832,0.730000,0.512684,0.383178,0.457944,0.485175 ....


And here is the embedding:

To me that is plenty interpretable, but I don't know exactly what you are trying to achieve.

For the same group of proteins we can start with a protein language model from here, specifically ProtT5-XL-U50, and create a 1024-vector matrix for each protein of interest. It looks in part like this:

vector_0001,vector_0002,vector_0003,vector_0004,vector_0005  ....
0.053524988,0.034982558,0.056690590,0.032287619,0.045326617 ....
0.029987951,-0.003866601,0.032291183,0.026446213,0.029312980 ....


And here is the embedding from the matrix above:

Again, this is plenty interpretable for my needs, but your mileage may vary.

0
Entering edit mode

How do you plot the embedding?

I am trying it on test data

0
Entering edit mode

t-SNE reduces the matrix to two dimensions, so it is a simple scatter plot afterwards.

0
Entering edit mode

Since I was using R to process my data, I used Rtsne to plot my data. Can you figure if something I am not doing correctly for the type of data I have (value range 0 - 1 and dim 560X560)?

data <- read.csv("data.csv")


# perform t-SNE on the data

tsne_results <- Rtsne(data)


# plot the results using the plot function

plot(tsne_results$Y, type = "n") text(tsne_results$Y, labels = unlist(lapply(str_split(data\$X, "_"), function(x) x[1])), cex = 0.5)

0
Entering edit mode

I use python. Providing an error message would help, but first you need to be sure of the data format inside tsne_results.