I have a matrix (or data frame, Excel spreadsheet). It is populated with alphanumeric identifiers (ATG numbers if you are familiar with plants; ex. At1G45623). The first column has only 1 occurrence of each identifier (no duplicates). Each row, after the first column, has a variable number of these alphanumeric identifiers. Anywhere from 2 to 350 in each row. A simplified version is below:
z a b c d e f g h i j
y a b c z w h n q i j
w j i h a f d c b g e p
x i j a b c d h e f m k l o u p s t v
So, each row is like an array where the name of the array is the value in the first column.
I would like to cluster these 'arrays' so that row z (array z) and row w cluster together and not row z with row y. Disregarding the first column, rows z and w have the same group of letters (except for 1 letter), but in a very different sequence. Rows z and y have similar sequences, but somewhat different letters. The method I am searching for would cluster rows z and w together before z and y.
The clustering techniques I am familiar with all take the sequence of the values into account. I am looking for a method that would disregard the sequence and just consider the contents of the row. This is further complicated by the fact that the rows have very variable numbers of values, and that is why I have included row x just to remember that the size of each 'array' is variable. Does anyone know of anything in Perl, Python or R that could help ?