Cluster sequences in a network by their editing distance - in R
0
0
Entering edit mode
16 months ago

I have a dataframe my_df with 10,000 different CDR3 sequences with different lengths (between 13to18) they comprised from different numbers (0-3)

example of my data :

alfa

20000003331001

200000303323331021

200000100331021

...

my goal is to cluster them by editing distance < 3.

dist_mtx=as.matrix(stringdistmatrix(my_df$alfa,my_df$alfa,method = "lv"))
dist_mtx[dist_mtx>3]=NA
dist_mtx[new_test_2==0]=NA
colnames(dist_mtx) <- dist_mtx$alfa rownames(dist_mtx) <- dist_mtx$alfa


then created an edge list , while the value represents the editing distance between any 2 sequences:

edge_list <- unique(melt(dist_mtx,na.rm = TRUE,varnames = c('seq1','seq2'),as.is = T))
edge_list=edge_list[!is.na(edge_list$value),]  then created the igraph object : igraph_obj <- igraph::graph_from_data_frame(edge_list,directed = F,vertices = dist_mtx$alfa)


then i tried numerous methods to try and cluster those sequences with louvain method and im still getting clusters which its members have editing distance > 3 , im aware that it might be because of the connected components. so my questions are :

1) is there a way to cluster to sequences together so that in each cluster the members would be with editing distance < 3 ?

2) is there a way to recognize the cluster centers (HUBS) , tried hubness.score() and assign vertices according to those centers with consideration of the editing distance ?

this is my first post , i will appreciate any help

R network analysis clustering editing distance • 422 views