python: find percentage of match between two sequences
2
1
Entering edit mode
5.3 years ago
Am.A ▴ 20

Hi all,

I want to find percentage of match between two sequences. I try to use get_matching_blocks(), but It does not work for large sequence, for example 1000 characters

x1="A................"
y1="T................"
s = SequenceMatcher(None, x1, y1)

result = s.get_matching_blocks()


is there any alternative method?

sequence • 11k views
3
Entering edit mode
5.3 years ago
Medhat 9.0k

you could use

needle from biopython to calculate similarity like this

def needle_align_code(query_seq, target_seq):
needle_cline = NeedleCommandline(asequence="asis:" + query_seq,
bsequence="asis:" + target_seq,
aformat="simple",
gapopen=10,
gapextend=0.5,
outfile='stdout'
)
out_data, err = needle_cline()
out_split = out_data.split("\n")
p = re.compile("$$(.*)$$")
return p.search(out_split[25]).group(1).replace("%", "")


also you could calculate the hamming distance "if they are the same length" like

   from itertools import izip
def hamming_distance(str1, str2):
assert len(str1) == len(str2)
return sum(chr1 != chr2 for chr1, chr2 in izip(str1, str2))


Or use levenshtein distance

more implementation here

Good luck :)

1
Entering edit mode

The Distance library for Python also has an implementation of hamming distance and some other metrics which I've found useful in the past.

0
Entering edit mode

I think even it could be better than some of my suggestions

1
Entering edit mode

The 'problem' is that needle is not a Biopython library, Biopython supports commandline access to an existing EMBOSS installation (which contains needle). Thus you need to install EMBOSS beforehand.

0
Entering edit mode

dear medhat thank you for your response. does needle from biopython work for 2 sequences of different length? (my query sequences around 400 nucleotide and the target sequence is 1500 nucleotide)

0
Entering edit mode

I think in this case you need local alignment not global, here is the code:

def water_align(query_seq, target_seq):
needle_cline = WaterCommandline(asequence="asis:" + query_seq,
bsequence="asis:" + target_seq,
aformat="simple",
gapopen=10,
gapextend=0.5,
outfile='stdout'
)
out_data, err = needle_cline()
out_split = out_data.split("\n")
p = re.compile("$$(.*)$$")
return p.search(out_split[25]).group(1).replace("%", "")

0
Entering edit mode

Is it the same if I use blast with local database?

0
Entering edit mode

in blast you are aligning the query sequence to all the sequence in a db but here as you can see we are comparing only two sequence

0
Entering edit mode

This is a great answer. Could you mention the import command that needs to go at the beginning of this code? I think we would need to import Biopython and Needle?

1
Entering edit mode
from Bio.Emboss.Applications import NeedleCommandline
from Bio.Emboss.Applications import WaterCommandline


For more please visit https://homolog.us/Biopython/api-reference.html

3
Entering edit mode
5.3 years ago
Markus ▴ 320

Percentage of match between two sequences is possible not what you want, you are very likely interested in percentage of macht between two aligned sequences. That opens the question how to align your sequences (which algorithm to use). And consider this case:

seq_1 : ATGGATCATTGA
seq_2:  ------CATTGA


Is seq_2 100% identical to seq_1? Would you say the same the other way round?

However, here is an example that shows how you may use Biopython to address your problem (of course you need to have Python and Biopython installed):

from __future__ import print_function
from Bio import pairwise2 as pw2

first_seq = 'ATGGATCATTGA'
second_seq = 'CATTGA'

global_align = pw2.align.globalxx(first_seq, second_seq)

print(global_align[0])


The output is:

('ATGGATCATTGA', '------CATTGA', 6.0, 0, 12)


pw2.align2.globalxx makes an optimal global alignment between your two sequences, where every match counts 1 point , while mismatches and insertions/deletions cost nothing.

print(global_align[0]) returns the first alignment (there may be several different alignments with the same score). The third list element (the first number) is the number of matching residues, the other two numbers are the beginning and the end of the alignment. So the alignment has a length of 12 (which, in this case, is also the length of the longer sequences, but usually the alignment will be longer than each of the two sequences), there are 6 matching residues, the shorter sequence has a length of 6. So you may calculate the percentage as follows:

seq_length = min(len(first_seq), len(second_seq))
matches = global_align[0][2]

percent_match = (matches / seq_length) * 100