I have sequences (sanger) that I trimmed using blast against reference sequences use the query start/end positions. Then using translate from Biopython, I translate each sequence in the 3 5'->3' frames (pre-trimming the sequences are rev-complemented), using a snippet similar to this one:
> print("In frame") print(record.seq.translate() > print("Offset by one") print(record.seq[1:].translate()) > print("Offset by two") print(record.seq[2:].translate())
Then to get the correct translation (due to sequencing errors it's not always clear), I take the sequence that has the least number of "X" and "" within each sequence. However, sometimes that doesn't always work. Sometimes the sequence I need has maybe one more X in it's translation. In the example below, I would need the inframe translation, but offset two is chosen because the sum of "X" and "" is 28 vs 29 in the inframe translation.
CAGGNGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGAGCCTGTCCCTCACCTGCNNNGTCTNTGGTGGGTCNNTCAGNGGGTANTACNGGAGCNGGATCCNCCNCNCCCCNCNNAANAGGGGGGGAGTGGNNNGGGGGAATTCNNNNNNNTGGGNAGGNNCCACTACAACCCNNNCCCCCTAGAAGACNAGANNGGNNNANGTNNNTGAAACCNNNNNTNNNTTCTTCCTCCTCCTGGTGGCAGCTCCCAGATGGGTCCTGTCCCAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGACCGGAGCAGCAGCTGCCTACCGCCTTTGGCTACTGGGGCCAGGGAACCCTAGTCACCGTCTCCTCA in frame: QXQLQQWGAGLLKPSESLSLTCXVXGGSXXGXYXSXIXXXPXXRGGVXXGNSXXWXGXTTTXXP*KTRXXXVXETXXXFFLLLVAAPRWVLSQVQLQQWGAGLLKPSETLSLTCAVYGGSFSGYYWSWIRQPPGKGLEWIGEINHSGSTNYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCARPEQQLPTAFGYWGQGTLVTVSS offset one: RXSYSSGAQDC*SLRRACPSPAXSXVGXSXGXTGAGSXXPXXXGGEWXGGIXXXGXXPLQPXPPRRXXXXXXXKPXXXSSSSWWQLPDGSCPRCSYSSGAQDC*SLRRPCPSPALSMVGPSVVTTGAGSASPQGRGWSGLGKSIIVEAPTTTRPSRVESPYQ*TRPRTSSP*S*AL*PPRTRLCITVRDRSSSCLPPLATGAREP*SPSP offset two: GAATAVGRRTVEAFGEPVPHLXXLWWVXQXVXXEXDPPXPXXXGGSGXGEFXXXGRXHYNPXPLEDXXGXXX*NXXXXLPPPGGSSQMGPVPGAATAVGRRTVEAFGDPVPHLRCLWWVLQWLLLELDPPAPREGAGVDWGNQS*WKHQLQPVPQESSHHISRHVQEPVLPEAELCDRRGHGCVLLCETGAAAAYRLWLLGPGNPSHRLL
This is all part of a much larger python script, so I'm just wondering if anyone has any suggestions on choosing the correct frame. I am not looking for ORF here, there will be no start/stop codon.
I know one thing I can do is do an additional blast with each frame, and take the one with the highest % id/e-value. Although, I wanted to avoid another blast if I could help it.