Extracting Target Sequence From Exonerate Output
2
4
Entering edit mode
11.0 years ago
Woa ★ 2.9k

Hello All,

I'm trying to parse an Exonerate generated alignment output file. The alignment was generated using the "protein2dna | bestfit" model and contains a single 'HSP' and multiple "HSPFragments" object within the HSP.

The alignment looks as follows: Pastebin Link . I wish to extract the 3 letter amino acid sequence of the target sequence from the alignment. However I don't know if there is any direct way to fetch the concatenated "HSPfragment" sequences and removing the gap characters.

I'm using the following code to extract fasta format sequences for each of the "HSPFragment"( seven for this case), writing them to a file and then concatanate individually, after removing the "X" characters which represents gap in the alignment.The "#" characters denote frameshifts and they don't appear in the fasta format sequences.The output can be found in this Pastebin Link

from Bio import SearchIO
qresult = SearchIO.read('bestfit.aln', 'exonerate-text')
all_frag= len(qresult.fragments)

for i in range(0,all_frag):
    qfrag= qresult.fragments[i].hit
    print qfrag.format("fasta")

Thanks

biopython parsing • 6.0k views
ADD COMMENT
2
Entering edit mode
11.0 years ago
bow ▴ 790

Hi Woa,

Thanks for posting the question with the input ~ it's always handy to have them :).

Answering your question, in short: no. There's no way to do this automatically with Biopython's SearchIO. For now, you have to edit / remove the 'X's manually I'm afraid.

The Exonerate parser here first tries to determine the fragments based on whether Exonerate outputs any exon/intron information. Then, if any frameshifts occured, it will split further based on these frameshifts. The '#' character was indeed removed, but the frameshift information is still retained. You can see them on the fragments using the .hit_frame / .query_frame or .hit_frames / .query_frames in the HSP objects.

Hope that answers your question :).

ADD COMMENT
0
Entering edit mode

Thanks Bow and lxe. Removing "X"s are trivial but is there any way to get just the sequences for each HSPFragments other than printing them in the FASTA format? I've tried the methods 'seq' and '_seq' available for a 'qfrag' object as found by dir(qfrag). But both of them produced some output like this containing truncated sequence:

>>> qfrag.seq

Seq('MSLIRVNRFGSRGGGRKTLKVKKKASVRQEWDNTVNDLTVHRATPEDLVRRHEI...EKR', ProteinAlphabet())
ADD REPLY
0
Entering edit mode

What you're seeing is the representation of Biopython's Seq object. You can get the actual string of the sequence of that object by using str, e.g. str(qfrag.seq).

ADD REPLY
0
Entering edit mode
11.0 years ago
pd3 ▴ 350

Take a look at the --ryo option of exonerate. One can customize the output to be more parsable. http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.html

ADD COMMENT
1
Entering edit mode

the link above disappeared, you can use the ubuntu man page instead: http://manpages.ubuntu.com/manpages/hardy/man1/exonerate.1.html

ADD REPLY

Login before adding your answer.

Traffic: 1441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6