Getting Pairwise Sequence Alignment Score With Biopython
2
0
Entering edit mode
11.8 years ago
Lakshmi • 0

Hello,

I used the following code to run clustalw.I got an alignment file using this code. But I need to get pairwise sequence alignment score and also has to get distance matrix based on sequence identity.My aim is to do hierarchical clustering. How can I do these things with biopython.

import os
from Bio.Align.Applications import ClustalwCommandline
clustalw_exe = r"C:\Program Files\clustalw2\clustalw2.exe"
clustalw_cline = ClustalwCommandline(clustalw_exe, infile="C:\Users\pp\Desktop\myseqs.fasta")
assert os.path.isfile(clustalw_exe), "Clustal W executable missing"
stdout, stderr = clustalw_cline()
from Bio import AlignIO
print align

biopython alignment clustalw • 4.6k views
0
Entering edit mode

So you basically want to build a phylogenetic tree?

0
Entering edit mode

Thanks for your comment. Yes I would like to create a dendrogram.

1
Entering edit mode
11.8 years ago

There are already plenty of software you can use to make a tree. You can try:

• [?]EBI's ClustalW phylogeny tool[?]
• [?]Mr Bayes[?]
• [?]PHYLIP[?]
• [?]A huge list of other phylogenetic tree generation software[?]

People usually tell me to us Mr Bayes.

0
Entering edit mode

1
Entering edit mode
4 months ago

Maybe quite late, but in case other people are looking for this. Pairwise sequence identity can be easily calculated using basic python functionality and numpy

    #load your alignment as biopython object

#get the number of sequences in the alignment
Nseq = len(alignment)

# Create a 2D NumPy array to store the PID values
seq_id_array = numpy.zeros((Nseq, Nseq))

# Iterate through each pair of sequences in the alignment
for i in range(Nseq):
for j in range(i+1, Nseq):
seq1 = alignment_refined_1[i]
seq2 = alignment_refined_1[j]

# Calculate the number of identical positions in the two sequences, ignoring positions with a gap
identical_residues = sum([1 for x, y in zip(seq1, seq2) if x == y and x != "-"])

# Calculate the PID, ignoring positions with a gap
pid = identical_residues / sum([1 for x in seq1 if x != "-"])

# Store the PID value in the NumPy array
seq_id_array[i, j] = pid
seq_id_array[j, i] = pid


This will give you a numpy array that can then be used to do any kind of operation, as e.g. clustering.

Hope it helps.

Best,

Jonathan

0
Entering edit mode

Thanks for sharing ! Just these days I am trying to wrap my mind on protein alignments. I wonder: 1) "PID" in your post is "percentage of identical matches" ? Is it the same as here: What is "pident" (percentage of identical matches) in the "Diamond" protein alignment ? 2) If you would you be so kind to update your example to show how the data in file 'alignment_refined_1.fasta' can be obtained by BioPython alignments, that would be very helpful ?

1
Entering edit mode

Dear Alexander,

to answer 2) first: This is just a fasta file that you can obtain essentially from every multiple sequence alignment software. I usually use MAFFT (https://mafft.cbrc.jp/alignment/software/), for which also a command line wrapper has been build within biopython (https://biopython.org/docs/1.76/api/Bio.Align.Applications.html). However, Clustal omega or muscle will also give you fasta files, if specified.

Regarding 1) In my example, i am looking at the total number of identical amino acids, apart from gapped positions, in the same position between two aligned sequences. This total number is then divided by the total number of sequence positions in the alignment. So with PID I mean pairwise identity. Sorry for the confusion.