How to check if my sequence is DNA or Protein in BioPython?
Entering edit mode
3 months ago
O.rka ▴ 710

I'm trying to develop a script that checks whether or not the input fasta is the correct molecule type but I want it to account for ambiguous characters.

Here's what I have for reading in fasta:

from Bio.SeqIO.FastaIO import SimpleFastaParser

with open(filepath, "r") as f:
    for header, seq in SimpleFastaParser(f):
        #func(seq) --> {dna or protein}

I know that Alphabet was removed so I'm a little confused on how to check this with BioPython. I could always look up the IUPAC alphabet for DNA and proteins then check if the sequence set is a subset but I feel like there has to be a better way.

genomics biopython fasta • 592 views
Entering edit mode

No need to look it up, it is already part of BioPython, see line 7 of BioPython's file:

"""Information about the IUPAC alphabets."""
Entering edit mode

I don't think any major sophistication is needed here. Amino acid frequencies for ACGT are well below 10% for proteins:

A 0.074
C 0.025
G 0.074
T 0.051

If for any two of them one counts > 10% frequency, it has to be a nucleic acid.

Similarly, C frequency in proteins is 2.5% which pretty much guarantees that dipeptides AC, GC and TC are going to be < 0.5%. On the other hand, dinucleotides AC, GC and TC should all be > 3-5% in nucleic acids. If any two of them are > 5% it has to be a nucleic acid.

These common-sense rules may fail on short individual sequences of unusual composition, but I think they should work on longer sequences or a group of 5+ sequences.

Entering edit mode

I think possibly even more simply (though no idea which is more computationally efficient), one could simply check that the set() of characters = [A, T, C, G, X] (or whatever ambiguous character is being used). If the sets don't match it has to be protein (or vice versa).


Login before adding your answer.

Traffic: 1944 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6