I am trying to apply different clustering algorithms on sequence data with lengths about 240-260. The sequences mainly come from nex-generation sequencing technology. For any clustering (OTU) method, we need some notion of distance among points/objects to assess how far/close objects/sequences are from each other. So far I have been using the edit-distance for this purpose, but now I want to use a more biologically relevent distance!
I have found some measures such as Kimura distance, Gamma distance, ... but among all these distance I don't know which one would fit to the type of data that I have! Is Kimura applicable for 16S rRNA sequences? Do you have any other suggestions, or papers that reviews application of these evolutionary distance measures for RNA/DNA fragments (not the whole genome)
Below, I have provided some detailed informations for the dataset that I am studying:
One example sequences in the dataset:
Here is the taxonomy levels for 10 sequences that I have:
Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae unclassified Bacteria Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides Bacteria Firmicutes Clostridia Clostridiales unclassified unclassified Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae unclassified Bacteria Bacteroidetes Bacteroidia Bacteroidales Porphyromonadaceae unclassified Bacteria Bacteroidetes Bacteroidia Bacteroidales Rikenellaceae Alistipes Bacteria unclassified unclassified unclassified unclassified unclassified Bacteria Bacteroidetes Bacteroidia Bacteroidales Porphyromonadaceae unclassified Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae unclassified Bacteria unclassified unclassified unclassified unclassified unclassified