I have short FASTA files (~400nt) of differing lengths. I would like to compare/align them and get a similarity metric. I am not interested in the exact alignment or alignment output. I just need one value that shows how similar they are. Perhaps a distance like hamming distance would work. I want to do this pairwise with loads of sequences. Can anyone suggest a bash command line tool?
Comparing pairwise similarity measures between ~3000 fasta pairs in the size range 0-10000 nt.
blast_identity: Percentage identity as returned by BLASTN
blastn -query query.fa -subject ref.fa -task blastn -outfmt "6 pident nident"
nw_identity: Percentage identity as returned by needle function in emboss
nw_similarity: Percentage similarity as returned by needle function in emboss
needle ref.fa query.fa -gapopen 10 -gapextend 0.5 -outfile needle_temp.txt
mash_dist: Distance returned by mash (-k 12)
mash dist -k 12 ref.fa query.fa
As mentioned in the comments, needleman-wunsch global alignment is probably the best metric for this purpose. Thank you all for the insightful comments!
A Hamming distance would have been my suggestion, though that only works for strings of identical length.
You might be better off using the Levenshtein Distance. There are lots of python implementations of it around the web. As b.nota mentioned, best to align them first, regardless (in which case you could just use the alignment score as your metric).
Edit, here's a snipped from one of my own scripts which can calculate Hamming distances:
def hamming_distance(s1, s2): """Return the Hamming distance between equal-length sequences""" if len(s1) != len(s2): raise ValueError("Undefined for sequences of unequal length") return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
And a whole wiki page with algo implementations: