Cluster sequences in a network by their editing distance - in R
0
0
Entering edit mode
4 months ago

I have a dataframe my_df with 10,000 different CDR3 sequences with different lengths (between 13to18) they comprised from different numbers (0-3)

example of my data :

alfa

20000003331001

200000303323331021

200000100331021

...

my goal is to cluster them by editing distance < 3.

dist_mtx=as.matrix(stringdistmatrix(my_df$alfa,my_df$alfa,method = "lv"))
dist_mtx[dist_mtx>3]=NA
dist_mtx[new_test_2==0]=NA
colnames(dist_mtx) <- dist_mtx$alfa
rownames(dist_mtx) <- dist_mtx$alfa

then created an edge list , while the value represents the editing distance between any 2 sequences:

edge_list <- unique(melt(dist_mtx,na.rm = TRUE,varnames = c('seq1','seq2'),as.is = T))
edge_list=edge_list[!is.na(edge_list$value),]

then created the igraph object :

igraph_obj <- igraph::graph_from_data_frame(edge_list,directed = F,vertices = dist_mtx$alfa)

then i tried numerous methods to try and cluster those sequences with louvain method and im still getting clusters which its members have editing distance > 3 , im aware that it might be because of the connected components. so my questions are :

1) is there a way to cluster to sequences together so that in each cluster the members would be with editing distance < 3 ?

2) is there a way to recognize the cluster centers (HUBS) , tried hubness.score() and assign vertices according to those centers with consideration of the editing distance ?

this is my first post , i will appreciate any help

R network analysis clustering editing distance • 183 views
ADD COMMENT

Login before adding your answer.

Traffic: 1764 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6