I have a dataframe my_df with 10,000 different CDR3 sequences with different lengths (between 13to18) they comprised from different numbers (0-3)
example of my data :
alfa
20000003331001
200000303323331021
200000100331021
...
my goal is to cluster them by editing distance < 3.
dist_mtx=as.matrix(stringdistmatrix(my_df$alfa,my_df$alfa,method = "lv"))
dist_mtx[dist_mtx>3]=NA
dist_mtx[new_test_2==0]=NA
colnames(dist_mtx) <- dist_mtx$alfa
rownames(dist_mtx) <- dist_mtx$alfa
then created an edge list , while the value represents the editing distance between any 2 sequences:
edge_list <- unique(melt(dist_mtx,na.rm = TRUE,varnames = c('seq1','seq2'),as.is = T))
edge_list=edge_list[!is.na(edge_list$value),]
then created the igraph object :
igraph_obj <- igraph::graph_from_data_frame(edge_list,directed = F,vertices = dist_mtx$alfa)
then i tried numerous methods to try and cluster those sequences with louvain method and im still getting clusters which its members have editing distance > 3 , im aware that it might be because of the connected components. so my questions are :
1) is there a way to cluster to sequences together so that in each cluster the members would be with editing distance < 3 ?
2) is there a way to recognize the cluster centers (HUBS) , tried hubness.score() and assign vertices according to those centers with consideration of the editing distance ?
this is my first post , i will appreciate any help