So, I want to cluster a group of genes with dist and hclust function in R. The data matrix is raw count. I know the genes symbol for the group of genes that I want to cluster. Because I used Ensembl ID for the matrix, I used Biomart for translating the gene symbol into the ensembl id. At this point, I realized several gene is duplicated with different ensembl id. I check one by one for those duplicate genes and decide to remove the alternative sequence genes. So, right now I have a non-duplicate and non-alternative sequence gene list with ensembl ID. Problem occur after I tried to compare the result of hierarchical clustering for list before and after removing the alternative sequence genes. The result is so different that probably will change the meaning of my analysis. My question is, should I filter or not the alternative sequence? If I check the entrez ID, there is only one entrez ID corresponds to gene symbol. So, the problem is in the ensembl ID. Thank you all.