How do I verify that a sequence only contains letters from a given alphabet: DNA, RNA, protein?
How do I verify that a sequence only contains letters from a given alphabet: DNA, RNA, protein?
Here is an efficient function written in Python:
dna = set("ATGC")
def validate(seq, alphabet=dna):
    "Checks that a sequence only contains values from an alphabet"
    leftover = set(seq.upper()) - alphabet
    return not leftover
# typical usage below
# this will print True
print validate('AAAATGCCG')
# this will print False
print validate('AAANTGCCG')
# using it with other alphabets
prot = set('ACDEFGHIKLMNPQRSTVWY')
print validate("mglsdgewql", alphabet=prot)
Use regular expressions. In python, you can do:
import re
def validate(seq, alphabet='dna'):
    """
    Check that a sequence only contains values from an alphabet
    >>> seq1 = 'acgatGAGGCATTtagcgatgcgatc'       # True for dna and protein
    >>> seq2 = 'FGFAGGAGFGAFFF'                   # False for dna, True for protein
    >>> seq3 = 'acacac '                          # should return False (a space is not a nucleotide)
    >>> validate(seq1, 'dna')
    True
    >>> validate(seq2, 'dna')
    False
    >>> validate(seq2, 'protein')
    True
    >>> validate(seq3, 'dna')
    False
    """
    alphabets = {'dna': re.compile('^[acgtn]*$', re.I), 
             'protein': re.compile('^[acdefghiklmnpqrstvwy]*$', re.I)}
    if alphabets[alphabet].search(seq) is not None:
         return True
    else:
         return False
if __name__ == "__main__":
    import doctest
    doctest.testmod(verbose=True)
In BioPerl SeqIO will guess the alphabet of a sequence. There are often warnings that these guesses can be imperfect! This is an issue with many of the home-brew answers here, they don't use the full IUPAC code. BioPerl is aware of the full IUPAC alphabet, and I would guess BioPython etc are too.
If you have a sequence containing only ACGT this can be correctly identified as either amino acids or nucleotides. Both are correct. A solution will therefore have to estimate a likelihood, perhaps given an observed frequency of amino acid residues in proteins of your data set?. This can be simple sometimes- if it contains an E residue it can't be a valid nucleotide sequence. Other times you will really need to specify biological estimates of likelihood of encountering certain residues in a sequence of a given length. I don't know of any software that does this, hence all are a bit error prone, at least with short or biased-composition sequences.
Take home message; its not quite so simple as sometimes assumed. If you want a good basic guess, one of the Bio* projects wil probably be as good as anything.
From a theoretical point of view, to make this quickly, you can base your code on the specificity of each alphabet. (Here I suppose you just want to know if a sequence is protein, DNA or RNA, based on Giovanni's comment)
RNA contains 'U', and not the others.
Proteins have much more letters than DNA and RNA.
So here is a piece of code in C++
#include <iostream>
#include <string>
/* suppose you give the raw sequence in capital *
 * letters in argument of your program          */
using namespace std;
int main(int argc, char **argv){
  string dna_alphabet  = "ATCG";
  string rna_alphabet  = "AUCG";
  string prot_alphabet = "ATCGRNDEQHILKMFPSWYV";
  char * sequence = argv[1];
  int i = 0;
  while (sequence[i] != '\0'){
    if (sequence[i] == 'U') {
      cout << "This is a RNA sequence\n";
      return 0;
    }
    if (prot_alphabet.find(sequence[i]) < prot_alphabet.size() && dna_alphabet.find(sequence[i]) > dna_alphabet.size()) {
      cout << "This is a Protein sequence\n";
      return 0;
    }
      ++i;
  }
  cout << "This is a DNA sequence\n";
  return 0;
}
@Farhat: even a human cannot answer this question... so a computer ? who knows ;)
@Pierre: For large alphabet, using lower_bound suppose the alphabet is sorted. And if it is sorted, I think string::find have the same complexity as lower_bound (not really sure but it is logical).
In C, see the standard function strspn: http://www.cplusplus.com/reference/clibrary/cstring/strspn/
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
the idea of this discussion was to collect all the possible ways and tools to determine if a sequence is DNA or protein
If you give the programming language you want, we can answer more precisely ;)
@bilouweb: just write the solution in the programming language you prefer, other people will implement it in the other languages.