Hi, I have a fasta file with hundred of sequences with a 300nt length.
I would like to do check what's the similarity of one sequence against all the other sequences.
Any suggestion on how to approach this?
are you starting from a multiple sequence alignment?
You might have a look at creating distance matrices (eg for phylogenetic studies) , though the 'distance' will often not be in %similarity but will give you a measure of the similarity
Use e.g. cd-hit or vsearch
Another tool that can be used and is much faster is MMSeqs2.
If your sequences are already aligned and/or the same length, or you do not want to align them, you can use some simple edit distance measures like the Levenshtein distance or other kmer based method.
This will be quick but will be less accurate, and won't necessarily capture meaningful biological patterns, but depending on your use case it may be appropriate.
I keep a few examples of string comparison metrics along with some implementation code here:
Login before adding your answer.
Use of this site constitutes acceptance of our User Agreement and Privacy