Question: Multiple Sequence Alignment In Biopython.
gravatar for Mkl
7.4 years ago by
Mkl100 wrote:

Hi all,

1) How to compare ASTRAL SCOP genetic domain sequences based on PDB SEQRES and PDB ATOM records?My aim is to find the missing residues from these records. I know that I have to do alignment using these sequences. How can I do alignment and how can I find missing residues using Biopython?

2) ClustalW takes a group of sequences and performs all pairwise alignments. It then calculates a similarity matrix, which it analyzes to see how distantly related the groups of sequences are. How can I perform these steps (pairwise sequence alignment, distance matrix, hierarchial clustering, dendrogram ) in Biopython?

biopython • 6.3k views
ADD COMMENTlink modified 7.4 years ago by Jan Kosinski1.6k • written 7.4 years ago by Mkl100
gravatar for Michael Schubert
7.4 years ago by
Cambridge, UK
Michael Schubert6.9k wrote:

BioPython provides I/O capabilities and handling, not the alignment algorithms itself. Therefore, you have to call an external program, e.g. ClustalW.

A possible workflow would be:

  • use BioPython to read the FASTA sequences from SCOP
  • use BioPython to read the PDB SEQRES/ATOM sequences (described here)
  • align them using ClustalW
  • search for gaps in the Alignment object

All of the steps are described in the BioPython cookbook, which I highly recommend you read.

ADD COMMENTlink written 7.4 years ago by Michael Schubert6.9k
gravatar for Jan Kosinski
7.2 years ago by
Jan Kosinski1.6k
Jan Kosinski1.6k wrote:

Using clustalw for this particular task (aligning SEQRES and ATOM sequences in order to find missing residues in PDB structure) is wrong.

To find a mapping of SEQRES to ATOM sequences you should rely on information from PDB in mmCIF format - aligning them with clustalw or any other program will not guarantee proper mapping. For example, in this alignment:


the mapping of asparagine (N) flanking the missing residues is ambiguous based on the alignment.

In another example, in my clustalx with default options (Gonnet250, gap open: 10, extend: 0.1) such alignment is optimal:


whereas it should be:


The missing residue information is included in pdbx_poly_seq_scheme field in mmCIF as questions marks . Also, this information is directly accessible through SEQATOMS database

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Jan Kosinski1.6k

Thanks so much! I have been searching for how to retrieve the alignments shown under the sequence tab on the PDB website for quite some time. Taking some time again today I finally found the answer :) I even knew it was in the mmCIF after some time but couldn't find where exactly!

ADD REPLYlink written 7.1 years ago by Jonasr120
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1117 users visited in the last hour