Question

Removing all stop codons from Sequence Record using Biopython

3

Entering edit mode

6.2 years ago

ckan91 ▴ 40

Hello Everyone,

I have sequences that occasionally have an erronious stop codon. Is there a way to filter a biopython Sequence Record of all stop codons?

Edit: The sequence is in frame and I would like to remove the whole codon for all sequences in the SeqRec. Apologies for the lack of clarity.

Thank you so much! Chris

biopython • 4.7k views

ADD COMMENT • link 6.2 years ago by ckan91 ▴ 40

score 1 · Answer 1 · 2018-02-01

1

Entering edit mode

6.2 years ago

Selenocysteine ▴ 620

Bastien is right, there are many unclear points in your question (is the sequence already in frame? Do you want to remove the whole codon or just 1 nucleotide?) etc. Assuming that your sequence is already in frame you can do this:

from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

codon_stop_array = ["TAG", "TGA", "TAA", "UGA", "UAA", "UAG"]

for record in SeqIO.parse("my_fasta_file.fasta", "fasta"):
    print(record.seq)
    tempRecordSeq = list(record.seq)
    for index in range(0, len(record.seq), 3):
        codon = record.seq[index:index+3]
        if codon in codon_stop_array:
            del tempRecordSeq[index:index+3]
    record.seq = Seq("".join(tempRecordSeq))

but this will also remove the last stop codon.

ADD COMMENT • link 6.2 years ago by Selenocysteine ▴ 620

0

Entering edit mode

Thank you for your help!

ADD REPLY • link 6.2 years ago by ckan91 ▴ 40

0

Entering edit mode

This was helpful. I used a variant of this code to prepare alignments for codeml. If you are using it in an alignment it is important to maintain sequence length, so instead of deleting the stop codon I replaced it with ambiguous characters.

def replace_stop_codons(record, codon_stop_array = ["TAG", "TGA", "TAA"]):
    tempRecordSeq = list(record.seq)
    for index in range(0, len(record.seq), 3):
            codon = record.seq[index:index+3]
            if codon in codon_stop_array:
                tempRecordSeq[index:index+3] = '?','?','?'
    record.seq = Seq("".join(tempRecordSeq))
    return record

ADD REPLY • link 4.3 years ago by au_ndh • 0

score 0 · Answer 2 · 2018-02-01

0

Entering edit mode

6.2 years ago

Bastien Hervé 5.3k

More information are necessary here, but assuming you don't want them to be in phase, try something like this :

from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

codon_stop_array=["TAG","TGA","TAA"]
record_without_stop=[]
record_with_stop=[]

for record in SeqIO.parse("your_fasta_file.fasta", "fasta"):
    if any(codon in record.seq for codon in codon_stop_array):
        record_with_stop.append(record)
    else:
        record_without_stop.append(record)