I am python script learner and I am using python 3.6. I have been working with DNA/Protein sequence files which are in 3 different formats (Phylip (.phy), clustal (.aln), fasta (.fas))
. I want to use the sequence files so as the sequences are clustered with each other one by one in a way the minimum number of changes are counted. (Whatever next sequence has minimum number of changes is clustered next i.e, most similar are clustered.) and in the end give it a tree like representation or newick format generation.
What I need to know is that what strategy should I use to cluster the sequences ? Should I use similarity matrix? But what are its basis/formula used for nucleotide/protein sequences? If a distance matrix generation as used in neighbor joining method , is not followed, then what could be the simplest strategy to cluster the two sequences together??