How to check if my sequence is DNA or Protein in BioPython?
0
0
Entering edit mode
12 months ago
O.rka ▴ 740

I'm trying to develop a script that checks whether or not the input fasta is the correct molecule type but I want it to account for ambiguous characters.

Here's what I have for reading in fasta:

from Bio.SeqIO.FastaIO import SimpleFastaParser

with open(filepath, "r") as f:
    for header, seq in SimpleFastaParser(f):
        #func(seq) --> {dna or protein}

I know that Alphabet was removed so I'm a little confused on how to check this with BioPython. I could always look up the IUPAC alphabet for DNA and proteins then check if the sequence set is a subset but I feel like there has to be a better way.

genomics biopython fasta • 1.0k views
ADD COMMENT
1
Entering edit mode

No need to look it up, it is already part of BioPython, see line 7 of BioPython's IUPACData.py file:

"""Information about the IUPAC alphabets."""
ADD REPLY
1
Entering edit mode

I don't think any major sophistication is needed here. Amino acid frequencies for ACGT are well below 10% for proteins:

A 0.074
C 0.025
G 0.074
T 0.051

If for any two of them one counts > 10% frequency, it has to be a nucleic acid.

Similarly, C frequency in proteins is 2.5% which pretty much guarantees that dipeptides AC, GC and TC are going to be < 0.5%. On the other hand, dinucleotides AC, GC and TC should all be > 3-5% in nucleic acids. If any two of them are > 5% it has to be a nucleic acid.

These common-sense rules may fail on short individual sequences of unusual composition, but I think they should work on longer sequences or a group of 5+ sequences.

ADD REPLY
0
Entering edit mode

I think possibly even more simply (though no idea which is more computationally efficient), one could simply check that the set() of characters = [A, T, C, G, X] (or whatever ambiguous character is being used). If the sets don't match it has to be protein (or vice versa).

ADD REPLY

Login before adding your answer.

Traffic: 1299 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6