Entering edit mode
12 months ago
O.rka
▴
740
I'm trying to develop a script that checks whether or not the input fasta is the correct molecule type but I want it to account for ambiguous characters.
Here's what I have for reading in fasta:
from Bio.SeqIO.FastaIO import SimpleFastaParser
with open(filepath, "r") as f:
for header, seq in SimpleFastaParser(f):
#func(seq) --> {dna or protein}
I know that Alphabet was removed so I'm a little confused on how to check this with BioPython. I could always look up the IUPAC alphabet for DNA and proteins then check if the sequence set is a subset but I feel like there has to be a better way.
No need to look it up, it is already part of BioPython, see line 7 of BioPython's
IUPACData.py
file:I don't think any major sophistication is needed here. Amino acid frequencies for ACGT are well below 10% for proteins:
If for any two of them one counts > 10% frequency, it has to be a nucleic acid.
Similarly, C frequency in proteins is 2.5% which pretty much guarantees that dipeptides AC, GC and TC are going to be < 0.5%. On the other hand, dinucleotides AC, GC and TC should all be > 3-5% in nucleic acids. If any two of them are > 5% it has to be a nucleic acid.
These common-sense rules may fail on short individual sequences of unusual composition, but I think they should work on longer sequences or a group of 5+ sequences.
I think possibly even more simply (though no idea which is more computationally efficient), one could simply check that the
set()
of characters =[A, T, C, G, X]
(or whatever ambiguous character is being used). If the sets don't match it has to be protein (or vice versa).