Question: (Bio)Python - choose correct frame translation post trimming from blast (No ORF)
0
gravatar for st.ph.n
2.8 years ago by
st.ph.n2.4k
Philadelphia, PA
st.ph.n2.4k wrote:

I have sequences (sanger) that I trimmed using blast against reference sequences use the query start/end positions. Then using translate from Biopython, I translate each sequence in the 3 5'->3' frames (pre-trimming the sequences are rev-complemented), using a snippet similar to this one:

> print("In frame") print(record.seq.translate()
> print("Offset by one") print(record.seq[1:].translate())
> print("Offset by two") print(record.seq[2:].translate())

Then to get the correct translation (due to sequencing errors it's not always clear), I take the sequence that has the least number of "X" and "" within each sequence. However, sometimes that doesn't always work. Sometimes the sequence I need has maybe one more X in it's translation. In the example below, I would need the inframe translation, but offset two is chosen because the sum of "X" and "" is 28 vs 29 in the inframe translation.

CAGGNGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGAGCCTGTCCCTCACCTGCNNNGTCTNTGGTGGGTCNNTCAGNGGGTANTACNGGAGCNGGATCCNCCNCNCCCCNCNNAANAGGGGGGGAGTGGNNNGGGGGAATTCNNNNNNNTGGGNAGGNNCCACTACAACCCNNNCCCCCTAGAAGACNAGANNGGNNNANGTNNNTGAAACCNNNNNTNNNTTCTTCCTCCTCCTGGTGGCAGCTCCCAGATGGGTCCTGTCCCAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGACCGGAGCAGCAGCTGCCTACCGCCTTTGGCTACTGGGGCCAGGGAACCCTAGTCACCGTCTCCTCA 
        in frame:   QXQLQQWGAGLLKPSESLSLTCXVXGGSXXGXYXSXIXXXPXXRGGVXXGNSXXWXGXTTTXXP*KTRXXXVXETXXXFFLLLVAAPRWVLSQVQLQQWGAGLLKPSETLSLTCAVYGGSFSGYYWSWIRQPPGKGLEWIGEINHSGSTNYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCARPEQQLPTAFGYWGQGTLVTVSS 
        offset one:     RXSYSSGAQDC*SLRRACPSPAXSXVGXSXGXTGAGSXXPXXXGGEWXGGIXXXGXXPLQPXPPRRXXXXXXXKPXXXSSSSWWQLPDGSCPRCSYSSGAQDC*SLRRPCPSPALSMVGPSVVTTGAGSASPQGRGWSGLGKSIIVEAPTTTRPSRVESPYQ*TRPRTSSP*S*AL*PPRTRLCITVRDRSSSCLPPLATGAREP*SPSP 
        offset two:     GAATAVGRRTVEAFGEPVPHLXXLWWVXQXVXXEXDPPXPXXXGGSGXGEFXXXGRXHYNPXPLEDXXGXXX*NXXXXLPPPGGSSQMGPVPGAATAVGRRTVEAFGDPVPHLRCLWWVLQWLLLELDPPAPREGAGVDWGNQS*WKHQLQPVPQESSHHISRHVQEPVLPEAELCDRRGHGCVLLCETGAAAAYRLWLLGPGNPSHRLL

This is all part of a much larger python script, so I'm just wondering if anyone has any suggestions on choosing the correct frame. I am not looking for ORF here, there will be no start/stop codon.

I know one thing I can do is do an additional blast with each frame, and take the one with the highest % id/e-value. Although, I wanted to avoid another blast if I could help it.

ADD COMMENTlink modified 19 months ago by Biostar ♦♦ 20 • written 2.8 years ago by st.ph.n2.4k

I can't help but wonder what you are trying to achieve, which simultaneously makes it rather difficult to take a stab at providing an answer.

ADD REPLYlink written 2.8 years ago by WouterDeCoster38k

edited to show example.

ADD REPLYlink written 2.8 years ago by st.ph.n2.4k

So in your example, perhaps I misunderstood, how do you know you need the in frame translation while it has more "X" & "*"?

ADD REPLYlink written 2.8 years ago by WouterDeCoster38k

The correct sequence is an antibody heavy chain v-region, which is the in frame translation. More often than not the sequencing is good. I use Biopython to extract the fasta from the sanger file (ab1), so I can't read the chromatogram to correct any errors prior to translation if it does have a lot of errors like this example. Downstream, errors can be corrected with imgt germ-line sequence, but not if the correct translation isn't chosen.

ADD REPLYlink written 2.8 years ago by st.ph.n2.4k
2
gravatar for Markus
2.8 years ago by
Markus230
Markus230 wrote:

Instead of doing another blast search you can use Biopython's built in pairwise2 module to make this comparison:

from Bio.Seq import translate
from Bio import pairwise2

query = 'CAGGNGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGAGCCTGTCCCTCACCTGCNNNGTCTNTGGTGGGTCNNTCAGNGGGTANTACNGGAGCNGGATCCNCCNCNCCCCNCNNAANAGGGGGGGAGTGGNNNGGGGGAATTCNNNNNNNTGGGNAGGNNCCACTACAACCCNNNCCCCCTAGAAGACNAGANNGGNNNANGTNNNTGAAACCNNNNNTNNNTTCTTCCTCCTCCTGGTGGCAGCTCCCAGATGGGTCCTGTCCCAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGACCGGAGCAGCAGCTGCCTACCGCCTTTGGCTACTGGGGCCAGGGAACCCTAGTCACCGTCTCCTCA'
target = 'MKHLWFFLLLVAAPRWVLSQVQLQQWGAGLLKPSETLSLTCAVYGGSFSGYYWSWIRQPPGKGLEWIGEINHSGSTNYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCAREIAPLSDFDYWGQGTLVTVSSGSASAPTLFPLVSCENSPSDTS'

for frame_start in range(3):
    frame = translate(query[frame_start:])
    score = pairwise2.align.localxx(frame, target, score_only=True)
    print('frame{} score: {}'.format(frame_start + 1, score))

The result looks like this:

frame1 score:  129.0
frame2 score:  62.0
frame3 score:  59.0

This gives a clear vote for the first frame. This will do the job as long as your sequencing errors are not indels which would shift the reading frame within the sequence. However, in that case your former approach with counting the numbers of X and * would also not work

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Markus230
0
gravatar for Asaf
2.8 years ago by
Asaf5.5k
Israel
Asaf5.5k wrote:

If it's possible try to manually go over the sequencing results and resolve the Ns (using the ab1 files). Regardless of the above, I think you should ignore the X's, you'll get different number according to the number of Ns ruining a codon. I would look at the longest translation to guess the right frame. Try to see if you have a frameshift mutation if you can't get a translation along the entire sequence. Paste the start in one frame and the end in another and blast it to see if they map to the same protein.

ADD COMMENTlink written 2.8 years ago by Asaf5.5k

@Asaf, see my comment above. In order to attempt to pull the correct sequence, I take the longest sequence, with the least number of "*"/"X". More often than not, the sequence is good, but in the example above it is a bad sequence but I still need to pull the correct one for the hc v-region. I suppose I would have to do an additional blast post translation, and take the one with the highest % id against imgt germ-line. I've runa a test which does give m the in -frame translation. However, as stated in the question, I hoped to avoid doing another blast. Sometimes I have hundreds of sequences and processing them manually isn't efficient.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by st.ph.n2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 823 users visited in the last hour