Question: Extracting Target Sequence From Exonerate Output
1
gravatar for Woa
5.8 years ago by
Woa2.7k
United States
Woa2.7k wrote:

Hello All,

I'm trying to parse an Exonerate generated alignment output file. The alignment was generated using the "protein2dna | bestfit" model and contains a single 'HSP' and multiple "HSPFragments" object within the HSP.

The alignment looks as follows: Pastebin Link . I wish to extract the 3 letter amino acid sequence of the target sequence from the alignment. However I don't know if there is any direct way to fetch the concatenated "HSPfragment" sequences and removing the gap characters.

I'm using the following code to extract fasta format sequences for each of the "HSPFragment"( seven for this case), writing them to a file and then concatanate individually, after removing the "X" characters which represents gap in the alignment.The "#" characters denote frameshifts and they don't appear in the fasta format sequences.The output can be found in this Pastebin Link

from Bio import SearchIO
qresult = SearchIO.read('bestfit.aln', 'exonerate-text')
all_frag= len(qresult.fragments)

for i in range(0,all_frag):
    qfrag= qresult.fragments[i].hit
    print qfrag.format("fasta")

Thanks

biopython parsing • 3.3k views
ADD COMMENTlink modified 2.5 years ago by Biostar ♦♦ 20 • written 5.8 years ago by Woa2.7k
2
gravatar for bow
5.8 years ago by
bow780
Netherlands
bow780 wrote:

Hi Woa,

Thanks for posting the question with the input ~ it's always handy to have them :).

Answering your question, in short: no. There's no way to do this automatically with Biopython's SearchIO. For now, you have to edit / remove the 'X's manually I'm afraid.

The Exonerate parser here first tries to determine the fragments based on whether Exonerate outputs any exon/intron information. Then, if any frameshifts occured, it will split further based on these frameshifts. The '#' character was indeed removed, but the frameshift information is still retained. You can see them on the fragments using the .hit_frame / .query_frame or .hit_frames / .query_frames in the HSP objects.

Hope that answers your question :).

ADD COMMENTlink written 5.8 years ago by bow780

Thanks Bow and lxe. Removing "X"s are trivial but is there any way to get just the sequences for each HSPFragments other than printing them in the FASTA format? I've tried the methods 'seq' and '_seq' available for a 'qfrag' object as found by dir(qfrag). But both of them produced some output like this containing truncated sequence:

>>> qfrag.seq

Seq('MSLIRVNRFGSRGGGRKTLKVKKKASVRQEWDNTVNDLTVHRATPEDLVRRHEI...EKR', ProteinAlphabet())
ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by Woa2.7k

What you're seeing is the representation of Biopython's Seq object. You can get the actual string of the sequence of that object by using str, e.g. str(qfrag.seq).

ADD REPLYlink written 5.8 years ago by bow780
0
gravatar for pd3
5.8 years ago by
pd3340
pd3340 wrote:

Take a look at the --ryo option of exonerate. One can customize the output to be more parsable. http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.html

ADD COMMENTlink written 5.8 years ago by pd3340

the link above disappeared, you can use the ubuntu man page instead: http://manpages.ubuntu.com/manpages/hardy/man1/exonerate.1.html

ADD REPLYlink written 3.5 years ago by Michael Dondrup45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1732 users visited in the last hour