I'm trying to parse an Exonerate generated alignment output file. The alignment was generated using the "protein2dna | bestfit" model and contains a single 'HSP' and multiple "HSPFragments" object within the HSP.
The alignment looks as follows: Pastebin Link . I wish to extract the 3 letter amino acid sequence of the target sequence from the alignment. However I don't know if there is any direct way to fetch the concatenated "HSPfragment" sequences and removing the gap characters.
I'm using the following code to extract fasta format sequences for each of the "HSPFragment"( seven for this case), writing them to a file and then concatanate individually, after removing the "X" characters which represents gap in the alignment.The "#" characters denote frameshifts and they don't appear in the fasta format sequences.The output can be found in this Pastebin Link
from Bio import SearchIO qresult = SearchIO.read('bestfit.aln', 'exonerate-text') all_frag= len(qresult.fragments) for i in range(0,all_frag): qfrag= qresult.fragments[i].hit print qfrag.format("fasta")