Question

(Bio)Python - choose correct frame translation post trimming from blast (No ORF)

0

Entering edit mode

7.8 years ago

st.ph.n ★ 2.7k

I have sequences (sanger) that I trimmed using blast against reference sequences use the query start/end positions. Then using translate from Biopython, I translate each sequence in the 3 5'->3' frames (pre-trimming the sequences are rev-complemented), using a snippet similar to this one:

> print("In frame") print(record.seq.translate()
> print("Offset by one") print(record.seq[1:].translate())
> print("Offset by two") print(record.seq[2:].translate())

Then to get the correct translation (due to sequencing errors it's not always clear), I take the sequence that has the least number of "X" and "" within each sequence. However, sometimes that doesn't always work. Sometimes the sequence I need has maybe one more X in it's translation. In the example below, I would need the inframe translation, but offset two is chosen because the sum of "X" and "" is 28 vs 29 in the inframe translation.

CAGGNGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGAGCCTGTCCCTCACCTGCNNNGTCTNTGGTGGGTCNNTCAGNGGGTANTACNGGAGCNGGATCCNCCNCNCCCCNCNNAANAGGGGGGGAGTGGNNNGGGGGAATTCNNNNNNNTGGGNAGGNNCCACTACAACCCNNNCCCCCTAGAAGACNAGANNGGNNNANGTNNNTGAAACCNNNNNTNNNTTCTTCCTCCTCCTGGTGGCAGCTCCCAGATGGGTCCTGTCCCAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGACCGGAGCAGCAGCTGCCTACCGCCTTTGGCTACTGGGGCCAGGGAACCCTAGTCACCGTCTCCTCA 
        in frame:   QXQLQQWGAGLLKPSESLSLTCXVXGGSXXGXYXSXIXXXPXXRGGVXXGNSXXWXGXTTTXXP*KTRXXXVXETXXXFFLLLVAAPRWVLSQVQLQQWGAGLLKPSETLSLTCAVYGGSFSGYYWSWIRQPPGKGLEWIGEINHSGSTNYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCARPEQQLPTAFGYWGQGTLVTVSS 
        offset one:     RXSYSSGAQDC*SLRRACPSPAXSXVGXSXGXTGAGSXXPXXXGGEWXGGIXXXGXXPLQPXPPRRXXXXXXXKPXXXSSSSWWQLPDGSCPRCSYSSGAQDC*SLRRPCPSPALSMVGPSVVTTGAGSASPQGRGWSGLGKSIIVEAPTTTRPSRVESPYQ*TRPRTSSP*S*AL*PPRTRLCITVRDRSSSCLPPLATGAREP*SPSP 
        offset two:     GAATAVGRRTVEAFGEPVPHLXXLWWVXQXVXXEXDPPXPXXXGGSGXGEFXXXGRXHYNPXPLEDXXGXXX*NXXXXLPPPGGSSQMGPVPGAATAVGRRTVEAFGDPVPHLRCLWWVLQWLLLELDPPAPREGAGVDWGNQS*WKHQLQPVPQESSHHISRHVQEPVLPEAELCDRRGHGCVLLCETGAAAAYRLWLLGPGNPSHRLL

This is all part of a much larger python script, so I'm just wondering if anyone has any suggestions on choosing the correct frame. I am not looking for ORF here, there will be no start/stop codon.

I know one thing I can do is do an additional blast with each frame, and take the one with the highest % id/e-value. Although, I wanted to avoid another blast if I could help it.

python blast translation frame biopython • 4.3k views

ADD COMMENT • link updated 6.6 years ago by Biostar 20 • written 7.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

I can't help but wonder what you are trying to achieve, which simultaneously makes it rather difficult to take a stab at providing an answer.

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

edited to show example.

ADD REPLY • link 7.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

So in your example, perhaps I misunderstood, how do you know you need the in frame translation while it has more "X" & "*"?

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

The correct sequence is an antibody heavy chain v-region, which is the in frame translation. More often than not the sequencing is good. I use Biopython to extract the fasta from the sanger file (ab1), so I can't read the chromatogram to correct any errors prior to translation if it does have a lot of errors like this example. Downstream, errors can be corrected with imgt germ-line sequence, but not if the correct translation isn't chosen.

ADD REPLY • link 7.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

7.8 years ago

Asaf 10k

If it's possible try to manually go over the sequencing results and resolve the Ns (using the ab1 files). Regardless of the above, I think you should ignore the X's, you'll get different number according to the number of Ns ruining a codon. I would look at the longest translation to guess the right frame. Try to see if you have a frameshift mutation if you can't get a translation along the entire sequence. Paste the start in one frame and the end in another and blast it to see if they map to the same protein.

ADD COMMENT • link 7.8 years ago by Asaf 10k

0

Entering edit mode

@Asaf, see my comment above. In order to attempt to pull the correct sequence, I take the longest sequence, with the least number of "*"/"X". More often than not, the sequence is good, but in the example above it is a bad sequence but I still need to pull the correct one for the hc v-region. I suppose I would have to do an additional blast post translation, and take the one with the highest % id against imgt germ-line. I've runa a test which does give m the in -frame translation. However, as stated in the question, I hoped to avoid doing another blast. Sometimes I have hundreds of sequences and processing them manually isn't efficient.

ADD REPLY • link 7.8 years ago by st.ph.n ★ 2.7k

score 2 · Accepted Answer · 2016-07-15

Instead of doing another blast search you can use Biopython's built in pairwise2 module to make this comparison:

from Bio.Seq import translate
from Bio import pairwise2

query = 'CAGGNGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGAGCCTGTCCCTCACCTGCNNNGTCTNTGGTGGGTCNNTCAGNGGGTANTACNGGAGCNGGATCCNCCNCNCCCCNCNNAANAGGGGGGGAGTGGNNNGGGGGAATTCNNNNNNNTGGGNAGGNNCCACTACAACCCNNNCCCCCTAGAAGACNAGANNGGNNNANGTNNNTGAAACCNNNNNTNNNTTCTTCCTCCTCCTGGTGGCAGCTCCCAGATGGGTCCTGTCCCAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGACCGGAGCAGCAGCTGCCTACCGCCTTTGGCTACTGGGGCCAGGGAACCCTAGTCACCGTCTCCTCA'
target = 'MKHLWFFLLLVAAPRWVLSQVQLQQWGAGLLKPSETLSLTCAVYGGSFSGYYWSWIRQPPGKGLEWIGEINHSGSTNYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCAREIAPLSDFDYWGQGTLVTVSSGSASAPTLFPLVSCENSPSDTS'

for frame_start in range(3):
    frame = translate(query[frame_start:])
    score = pairwise2.align.localxx(frame, target, score_only=True)
    print('frame{} score: {}'.format(frame_start + 1, score))

The result looks like this:

frame1 score:  129.0
frame2 score:  62.0
frame3 score:  59.0

This gives a clear vote for the first frame. This will do the job as long as your sequencing errors are not indels which would shift the reading frame within the sequence. However, in that case your former approach with counting the numbers of X and * would also not work