Question: Similarity between two FASTA files
0
gravatar for rmf
20 months ago by
rmf910
rmf910 wrote:

I have short FASTA files (~400nt) of differing lengths. I would like to compare/align them and get a similarity metric. I am not interested in the exact alignment or alignment output. I just need one value that shows how similar they are. Perhaps a distance like hamming distance would work. I want to do this pairwise with loads of sequences. Can anyone suggest a bash command line tool?

distance similarity • 1.5k views
ADD COMMENTlink modified 9 months ago by Biostar ♦♦ 20 • written 20 months ago by rmf910
1

I haven't tried all the suggestions here, but I came across this BLAST usage:

blastn -query "query.fa" -subject "ref.fa" -task "blastn" -outfmt 6 >> blast_results.txt

And this returns

query     ref    per_ident aln_len mismatch gapopen qstart  qend sstart send evalue   bitscore
query.fa  ref.fa 74.206    252     46       3       1       248  1      237  9.65e-46 165

of which percentage identity (74%) seems to be a usable measure of similarity. I am not sure if this is the best metric for global sequence-to-sequence similarity.

ADD REPLYlink modified 20 months ago • written 20 months ago by rmf910
1

It’s definitively not a measure of global sequence similarity.

The L in BLAST stands for “local”.

This is broadly the approach a few of us have suggested (use alignment and take an alignment score of choice as your metric), but global alignments are probably better suited to your question (though it would be worth you showing us some examples of your sequences).

ADD REPLYlink modified 20 months ago • written 20 months ago by Joe16k

Ah yes! Good point. But, does local matter for short sequences of few hundred nucleotides?

ADD REPLYlink modified 20 months ago • written 20 months ago by rmf910
1

It might matter, it might not - “short” is all relative. 400bps might seem short, but its more than long enough to carry an entire promoter sequence and various other untranslated features. How you align these would matter if you were interested in such structures.

It sort of depends what you’re trying to show. What your BLAST results are telling you is that, of the amount of sequence it could locally align (in your case 252 nucleotides of your 400ish), they are ~74% identical.

Now, what this tells me is that only a little over half of your sequence is similar to the other. It says almost nothing about the remaining 150 bases that haven’t been aligned.

If you were only interested in a domain or structure or something that fell within that 252, then you could carry on, but otherwise you are ignoring almost a third of the sequence if you take that metric.

ADD REPLYlink modified 20 months ago • written 20 months ago by Joe16k

I am comparing human X and Y chromosomes. Y is rearranged and jumbled up. But I have already identified short regions in X that correspond with Y. Now I just need to figure out how similar they are. Here is an example

>chrX:153598967-153599086
TCCATGATGAACCTCGCCACTCGGAAGTGCACCCTGGCTTCGGGCGCCGGCTGCCCCTCC
GGGCCCCGCGCTGCAGGCCCCCCTCCGCCGCCCTCCGGGGCTTCCATGAGGCGCCGCGGC
>chrY:26627155-26627274
TCCATGATGAACCTCGCCGCTTGGAAGTGCACCCTGGATTCGGGCGCCGGCTGCCTCTCT
GGGCACCGCGCTGCGGACCCCACTCCGCCGCCCTCCCAGGCTTCCATGAAGCTCCGGGGC

blastn returns this as 88.3% percentage identity. The lengths are same in this particular example but lengths can vary a bit.

ADD REPLYlink modified 20 months ago • written 20 months ago by rmf910
2

@rmf, BLAST isn’t telling you those 2 sequences are 88% identical.

As my comment above, its telling you how similar the maximally aligning subsequence it could find between the 2 sequences are.

Now, I don’t know if your BLAST results did manage to align the full length of those sequences, but I doubt it.

BLAST is not really the tool you want for this task.

ADD REPLYlink modified 20 months ago • written 20 months ago by Joe16k
2

blast is definitively not the right tool for pair wise alignment, it was design to quickly scan thru a db. For proper alignment you'll have to use Smith-Waterman for local alignment or Needleman-Wunsch for global. Both are in the EMBOSS linux tool.

ADD REPLYlink written 20 months ago by Benn7.9k
1

In general this is indeed an OK approach.

You just need to keep in mind that it only reports something in relation to the part that can be aligned! so it might report 80% but only align 50% of the sequence (which is then does clearly something different than on 80% overall similarity)

ADD REPLYlink written 20 months ago by lieven.sterck7.2k

Are the short regions you already identified of somewhat similar length? (taking your original question in mind it seems not)

ADD REPLYlink written 20 months ago by lieven.sterck7.2k

They are comparable in lengths. The length difference range from 0 to 83 nt.

ADD REPLYlink written 20 months ago by rmf910
2

In that case I might consider using a global aligner , such as NW from EMBOSS . For a global aligner the rule is that they have to be of similar size not strictly of equal size

ADD REPLYlink written 20 months ago by lieven.sterck7.2k

I just need one value that shows how similar they are

An alignment would give you similarity score which you may be looking for

ADD REPLYlink written 20 months ago by lakhujanivijay4.8k
1

It's a simple question/request but it might turn out more complex than you hoped for I'm afraid.

Eg. a 50bp seq is 100% part of a 100bp sequence : what would be the expected similarity score here? 1 or 0.5?

It's indeed the

of differing lengths

that makes this a lot more complicated.

That's why most methods/approaches will make the 'equal length' assumption

ADD REPLYlink modified 20 months ago • written 20 months ago by lieven.sterck7.2k

Just to clarify, are you also interested in their global similarity, or regions of local similarity? (It sounds like the former at the moment)?

ADD REPLYlink written 20 months ago by Joe16k
6
gravatar for Benn
20 months ago by
Benn7.9k
Netherlands
Benn7.9k wrote:

To get similarity you'll need alignment first, even if you are not interested in it. You can for instance use the identity, similarity, or other scores. Try EMBOSS on linux, it has many alignment options.

ADD COMMENTlink modified 20 months ago • written 20 months ago by Benn7.9k
2

@ rmf, the above answer is the one you need.

Download the EMBOSS commandline tools from here: http://emboss.sourceforge.net/download/

intall, and run:

needle -asequence <fasta1.fa> -bsequence <fasta2.fa> -gapopen 10 -gapextend 0.5 -o <outfile.needle>
ADD REPLYlink modified 20 months ago • written 20 months ago by Joe16k
1

I'm with b.nota on this one. I don't think you can bypass the alignment step here, unless you're totally not interested in the "biological" interpretation, but even then ....

FastA might also be worth considering

ADD REPLYlink written 20 months ago by lieven.sterck7.2k
3
gravatar for rmf
20 months ago by
rmf910
rmf910 wrote:

Comparing pairwise similarity measures between ~3000 fasta pairs in the size range 0-10000 nt.

blast_identity: Percentage identity as returned by BLASTN

blastn -query query.fa -subject ref.fa -task blastn -outfmt "6 pident nident"

nw_identity: Percentage identity as returned by needle function in emboss
nw_similarity: Percentage similarity as returned by needle function in emboss

needle ref.fa query.fa -gapopen 10 -gapextend 0.5 -outfile needle_temp.txt

mash_dist: Distance returned by mash (-k 12)

mash dist -k 12 ref.fa query.fa

enter image description here

As mentioned in the comments, needleman-wunsch global alignment is probably the best metric for this purpose. Thank you all for the insightful comments!

ADD COMMENTlink modified 20 months ago • written 20 months ago by rmf910

rmf : You should accept @5heikki's mash answer as well since you made use of that in your final solution.

ADD REPLYlink written 20 months ago by genomax80k

@genomax I used emboss needle as the final solution which is why I accepted that as the answer.

ADD REPLYlink written 20 months ago by rmf910

You can accept more than one answers.

ADD REPLYlink written 20 months ago by genomax80k

Oh. I wasn't aware of that. Good to know. Thx!

ADD REPLYlink written 20 months ago by rmf910

You can also accept your own answer since it brings together all elements in one.

ADD REPLYlink written 20 months ago by genomax80k

Kudos for thoroughly doing the comparisons and reporting back!

I'm a little surprised that the BLAST distributions look the same as the NW!

ADD REPLYlink written 20 months ago by Joe16k
1

Probably because there are very little gaps in alignments. Also I suppose the mash distance could be improved by setting a better value for the kmer length. I used 12 arbitrarily.

ADD REPLYlink written 20 months ago by rmf910

Yeah that would be my guess too, if the sequences were less similar you'd start to see a bigger gulf appearing between local and global probably.

ADD REPLYlink written 20 months ago by Joe16k
2
gravatar for 5heikki
20 months ago by
5heikki8.6k
Finland
5heikki8.6k wrote:
mash dist file1.fna file2.fna

I recommend smaller than default k-mer for small files, e.g. 17

Mash

ADD COMMENTlink modified 20 months ago • written 20 months ago by 5heikki8.6k

I thought about mash too, wasn't sure how well it works for very small seqs - but this is good to know!

ADD REPLYlink written 20 months ago by Joe16k

Well I don't know how well it works with very small files, maybe if setting k-mer length even shorter..

ADD REPLYlink written 20 months ago by 5heikki8.6k
2
gravatar for Joe
20 months ago by
Joe16k
United Kingdom
Joe16k wrote:

A Hamming distance would have been my suggestion, though that only works for strings of identical length.

You might be better off using the Levenshtein Distance. There are lots of python implementations of it around the web. As b.nota mentioned, best to align them first, regardless (in which case you could just use the alignment score as your metric).

Edit, here's a snipped from one of my own scripts which can calculate Hamming distances:

def hamming_distance(s1, s2):
     """Return the Hamming distance between equal-length sequences"""
     if len(s1) != len(s2):
         raise ValueError("Undefined for sequences of unequal length")
     return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

And a whole wiki page with algo implementations:

https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python

ADD COMMENTlink modified 20 months ago • written 20 months ago by Joe16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1186 users visited in the last hour