Question

Multiple Sequence Alignment In Biopython.

1

Entering edit mode

13.7 years ago

Mkl ▴ 100

Hi all,

1) How to compare ASTRAL SCOP genetic domain sequences based on PDB SEQRES and PDB ATOM records?My aim is to find the missing residues from these records. I know that I have to do alignment using these sequences. How can I do alignment and how can I find missing residues using Biopython?

2) ClustalW takes a group of sequences and performs all pairwise alignments. It then calculates a similarity matrix, which it analyzes to see how distantly related the groups of sequences are. How can I perform these steps (pairwise sequence alignment, distance matrix, hierarchial clustering, dendrogram ) in Biopython?

biopython • 11k views

ADD COMMENT • link updated 13.7 years ago by Jan Kosinski ★ 1.6k • written 13.7 years ago by Mkl ▴ 100

score 3 · Answer 1 · 2011-10-26

BioPython provides I/O capabilities and handling, not the alignment algorithms itself. Therefore, you have to call an external program, e.g. ClustalW.

A possible workflow would be:

use BioPython to read the FASTA sequences from SCOP
use BioPython to read the PDB SEQRES/ATOM sequences (described here)
align them using ClustalW
search for gaps in the Alignment object

All of the steps are described in the BioPython cookbook, which I highly recommend you read.

score 3 · Answer 2 · 2012-01-26

Using clustalw for this particular task (aligning SEQRES and ATOM sequences in order to find missing residues in PDB structure) is wrong.

To find a mapping of SEQRES to ATOM sequences you should rely on information from PDB in mmCIF format - aligning them with clustalw or any other program will not guarantee proper mapping. For example, in this alignment:

GNIKANR
GN----R

the mapping of asparagine (N) flanking the missing residues is ambiguous based on the alignment.

In another example, in my clustalx with default options (Gonnet250, gap open: 10, extend: 0.1) such alignment is optimal:

ANIKANR
A---IKR

whereas it should be:

ANIKANR
A-IK--R

The missing residue information is included in pdbx_poly_seq_scheme field in mmCIF as questions marks . Also, this information is directly accessible through SEQATOMS database