Question: cluster groups of values disregarding specific sequence of values
1
mccormack70 wrote:

I have a matrix (or data frame, Excel spreadsheet). It is populated with alphanumeric identifiers (ATG numbers if you are familiar with plants; ex. At1G45623). The first column has only 1 occurrence of each identifier (no duplicates). Each row, after the first column, has a variable number of these alphanumeric identifiers. Anywhere from 2 to 350 in each row. A simplified version is below:

z a b c d e f g h i j

y a b c z w h n q i j

w j i h a f d c b g e p

x i j a b c d h e f m k l o u p s t v

So, each row is like an array where the name of the array is the value in the first column.

I would like to cluster these 'arrays' so that row z (array z) and row w cluster together and not row z with row y. Disregarding the first column, rows z and w have the same group of letters (except for 1 letter), but in a very different sequence. Rows z and y have similar sequences, but somewhat different letters. The method I am searching for would cluster rows z and w together before z and y.

The clustering techniques I am familiar with all take the sequence of the values into account. I am looking for a method that would disregard the sequence and just consider the contents of the row. This is further complicated by the fact that the rows have very variable numbers of values, and that is why I have included row x just to remember that the size of each 'array' is variable. Does anyone know of anything in Perl, Python or R that could help ?

modified 4.6 years ago • written 4.6 years ago by mccormack70
4
Jean-Karim Heriche24k wrote:

Compute the Jaccard index (or any other suitable measure of similarity over sets, see the sets package in R) and use a clustering algorithm.

2
dariober11k wrote:

What about creating a n x n matrix of 0 and 1 with n being the number of "letters" (ATG numbers), then cluster using R methods implemented in for example dist and hclust. For example your input would look like `dat`:

``````set.seed(1234)
dat<- matrix(data= sample(c(0, 1), size= 100, replace= TRUE), nrow= 10)
rownames(dat)<- letters[1:10]
colnames(dat)<- letters[1:10]

> dat
a b c d e f g h i j
a 0 1 0 0 1 0 1 0 1 0
b 1 1 0 0 1 0 0 1 0 1
c 1 0 0 0 0 1 0 0 0 0
d 1 1 0 1 1 1 0 1 1 0
e 1 0 0 0 0 0 0 0 0 0
f 1 1 1 1 1 1 1 1 1 1
g 0 0 1 0 1 0 0 0 0 0
h 0 0 1 0 0 1 1 0 0 0
i 1 0 1 1 0 0 0 0 0 0
j 1 0 0 1 1 1 1 1 1 1
``````

So in this matrix 1 means the row contains the given identifier.

Then cluster and plot a tree:

``````clust<- hclust(dist(dat))
plot(clust)
``````

Does it make sense? Does it sum up to what Jean-Karim suggest?

1

Yes this is equivalent to my suggestion. Another option would be to consider that each row represents an edge list from a graph and cluster the nodes.

``````I will look into the Jaccard index and the sets package in R.  This seems very promising and has been very helpful.
``````

I have a question for dariober. I may not have explained this well or I may be misunderstanding your reply. It seems that a single n X n matrix would only be able cluster 1 ATG number, but my original data has hundreds of different ATG numbers. Each row will contain many different ATG numbers. It seems to me from your dat matrix that 'row a' has the same identifier 4 times.

In my data, a row will not have duplicate ATG numbers, but ATG numbers do occur more than once when considering multiple rows (for example, AT1G01060 occurs in multiple rows). Below, is a small snippet of my data. (The ATG numbers may make it a little more confusing so in my question above I used letters.)

AT1G01010 AT1G01060 AT1G02230 AT1G06280
AT1G01020 AT1G01060 AT1G02230 AT1G02250 AT1G06280 AT1G12260 AT1G12630
AT1G01030 AT1G01060
AT1G01040 AT1G01060 AT1G01720 AT1G03800 AT1G06280 AT1G06850
AT1G01046 AT1G02230
AT1G01050 AT1G03840 AT1G06280 AT1G06850 AT1G12260 AT1G01060 AT1G02250 AT1G06280 AT1G06850 AT1G08320 AT1G09540
AT1G01070 AT1G01720 AT1G03840

1

If you consider that a row of your data is a list of edges of a graph, i.e. each row lists the other items linked to the one at the beginning of the row, then @dariober's matrix is the adjacency matrix of the graph.

I think Jean-Karim gave the right explanation, here I try to be more descriptive:

It seems that a single n X n matrix would only be able cluster 1 ATG number

No, the n x n matrix represents all the ATG numbers you have, not just one.

It seems to me from your dat matrix that 'row a' has the same identifier 4 times.

No, row "a" is associated (linked) to the (distinct) identifier b, e g, i. Row b is linked to a, b, e, h, j and so on

In my data, a row will not have duplicate ATG numbers, but ATG numbers do occur more than once when considering multiple rows (for example, AT1G01060 occurs in multiple rows)

Yes, and so is the matrix above.

Just try to execute the code snippet I posted and see if it makes sense. You will see that f and j go together followed by d.

Thank you again Jean-Karim and dariober. Once I looked-up adjacency matrix I understood, and the code is very helpful, also. One last question, hopefully not to try your patience, is there a quick way to convert my data to its adjacency matrix ? Any direction that you could guide towards ?

You'll most likely have to rewrite the data so that you get an edge per line, e.g.
AT1G01010 AT1G01060
AT1G01010 AT1G02230
then you can read this with most graph software. If you're into R, I suggest you look at the igraph package.