Hello All,
I'm trying to parse an Exonerate generated alignment output file. The alignment was generated using the "protein2dna | bestfit" model and contains a single 'HSP' and multiple "HSPFragments" object within the HSP.
The alignment looks as follows: Pastebin Link . I wish to extract the 3 letter amino acid sequence of the target sequence from the alignment. However I don't know if there is any direct way to fetch the concatenated "HSPfragment" sequences and removing the gap characters.
I'm using the following code to extract fasta format sequences for each of the "HSPFragment"( seven for this case), writing them to a file and then concatanate individually, after removing the "X" characters which represents gap in the alignment.The "#" characters denote frameshifts and they don't appear in the fasta format sequences.The output can be found in this Pastebin Link
from Bio import SearchIO
qresult = SearchIO.read('bestfit.aln', 'exonerate-text')
all_frag= len(qresult.fragments)
for i in range(0,all_frag):
qfrag= qresult.fragments[i].hit
print qfrag.format("fasta")
Thanks
Thanks Bow and lxe. Removing "X"s are trivial but is there any way to get just the sequences for each HSPFragments other than printing them in the FASTA format? I've tried the methods 'seq' and '_seq' available for a 'qfrag' object as found by dir(qfrag). But both of them produced some output like this containing truncated sequence:
What you're seeing is the representation of Biopython's Seq object. You can get the actual string of the sequence of that object by using
str
, e.g.str(qfrag.seq)
.