How can I remove the sequences that contain ambiguous amino acids from a multiple FASTA file? (with python))
2
0
Entering edit mode
21 months ago
M. ▴ 30

I have a FASTA file with numerous protein sequences (the header in one line and amino acid codes in several lines). Some of these sequences contain ambiguous or exceptional amino acid codes (e.g., B, J, O, U, Z, X, -- ). I want to remove sequences containing such code and generate a new FASTA file. How I can do this in python? I did manage the remove one amino acid code (X) at a time with the following code. But how can I remove them all at once?

from Bio import SeqIO

       sequences = SeqIO.parse("sequences.fasta", "fasta")
       filtered = [seq for seq in sequences if seq.seq.count('X') == 0]

       with open('sequences_without_Xs', 'wt') as output:
             SeqIO.write(filtered, output, 'fasta')
python removing ambiguous amino acids • 1.3k views
ADD COMMENT
3
Entering edit mode
21 months ago

This code removes sequences that contain at least one character that is not an amino acid.

from Bio import SeqIO

AMINOACIDS = set('ACDEFGHIKLMNPRSTWVQY')

with open('sequences_valid.fasta', 'w') as output:
      for seq_record in SeqIO.parse("sequences.fasta", "fasta"):
            if not set(seq_record.seq).difference(AMINOACIDS):
                  output.write(seq_record.format('fasta'))
ADD COMMENT
0
Entering edit mode

Thank you!! That works for every possible flaw.

ADD REPLY
2
Entering edit mode
21 months ago
Mensur Dlakic ★ 27k

Using the code below instead of your filtered line should do the trick.

filtered = [
    seq
    for seq in sequences
    if seq.seq.count("X") == 0
    and seq.seq.count("B") == 0
    and seq.seq.count("J") == 0
    and seq.seq.count("O") == 0
    and seq.seq.count("U") == 0
    and seq.seq.count("Z") == 0
]
ADD COMMENT
0
Entering edit mode

Oh... Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2572 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6