Question

List All The Tools Or Write A Script To Validate That A Sequence Only Contains Letters From A Given Alphabet

5

Entering edit mode

15.7 years ago

Biostar User ★ 1.0k

How do I verify that a sequence only contains letters from a given alphabet: DNA, RNA, protein?

dna sequence • 14k views

ADD COMMENT • link updated 2.0 years ago by Ram 45k • written 15.7 years ago by Biostar User ★ 1.0k

0

Entering edit mode

the idea of this discussion was to collect all the possible ways and tools to determine if a sequence is DNA or protein

ADD REPLY • link 14.8 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

If you give the programming language you want, we can answer more precisely ;)

ADD REPLY • link 14.8 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

@bilouweb: just write the solution in the programming language you prefer, other people will implement it in the other languages.

ADD REPLY • link 14.8 years ago by Giovanni M Dall'Olio 28k

Ram · Answer 1 · 2010-03-02

7

Entering edit mode

15.7 years ago

Biostar User ★ 1.0k

Here is an efficient function written in Python:

dna = set("ATGC")
def validate(seq, alphabet=dna):
    "Checks that a sequence only contains values from an alphabet"
    leftover = set(seq.upper()) - alphabet
    return not leftover

# typical usage below

# this will print True
print validate('AAAATGCCG')

# this will print False
print validate('AAANTGCCG')

# using it with other alphabets
prot = set('ACDEFGHIKLMNPQRSTVWY')
print validate("mglsdgewql", alphabet=prot)

ADD COMMENT • link updated 2.0 years ago by Ram 45k • written 15.7 years ago by Biostar User ★ 1.0k

0

Entering edit mode

nice, but I would include the N in the dna set.

ADD REPLY • link 15.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

these are an old question and answer.. I would use regex

ADD REPLY • link 14.8 years ago by Giovanni M Dall'Olio 28k

Ram · Answer 2 · 2011-01-21

Use regular expressions. In python, you can do:

import re

def validate(seq, alphabet='dna'):
    """
    Check that a sequence only contains values from an alphabet

    >>> seq1 = 'acgatGAGGCATTtagcgatgcgatc'       # True for dna and protein
    >>> seq2 = 'FGFAGGAGFGAFFF'                   # False for dna, True for protein
    >>> seq3 = 'acacac '                          # should return False (a space is not a nucleotide)
    >>> validate(seq1, 'dna')
    True
    >>> validate(seq2, 'dna')
    False
    >>> validate(seq2, 'protein')
    True
    >>> validate(seq3, 'dna')
    False

    """
    alphabets = {'dna': re.compile('^[acgtn]*$', re.I), 
             'protein': re.compile('^[acdefghiklmnpqrstvwy]*$', re.I)}


    if alphabets[alphabet].search(seq) is not None:
         return True
    else:
         return False

if __name__ == "__main__":
    import doctest
    doctest.testmod(verbose=True)

Ram · Answer 3 · 2011-01-21

In BioPerl SeqIO will guess the alphabet of a sequence. There are often warnings that these guesses can be imperfect! This is an issue with many of the home-brew answers here, they don't use the full IUPAC code. BioPerl is aware of the full IUPAC alphabet, and I would guess BioPython etc are too.

If you have a sequence containing only ACGT this can be correctly identified as either amino acids or nucleotides. Both are correct. A solution will therefore have to estimate a likelihood, perhaps given an observed frequency of amino acid residues in proteins of your data set?. This can be simple sometimes- if it contains an E residue it can't be a valid nucleotide sequence. Other times you will really need to specify biological estimates of likelihood of encountering certain residues in a sequence of a given length. I don't know of any software that does this, hence all are a bit error prone, at least with short or biased-composition sequences.

Take home message; its not quite so simple as sometimes assumed. If you want a good basic guess, one of the Bio* projects wil probably be as good as anything.

Ram · Answer 4 · 2011-01-21

2

Entering edit mode

14.8 years ago

Bilouweb ★ 1.1k

From a theoretical point of view, to make this quickly, you can base your code on the specificity of each alphabet. (Here I suppose you just want to know if a sequence is protein, DNA or RNA, based on Giovanni's comment)

RNA contains 'U', and not the others.

Proteins have much more letters than DNA and RNA.

So here is a piece of code in C++

#include <iostream>
#include <string>

/* suppose you give the raw sequence in capital *
 * letters in argument of your program          */

using namespace std;

int main(int argc, char **argv){
  string dna_alphabet  = "ATCG";
  string rna_alphabet  = "AUCG";
  string prot_alphabet = "ATCGRNDEQHILKMFPSWYV";
  char * sequence = argv[1];

  int i = 0;
  while (sequence[i] != '\0'){
    if (sequence[i] == 'U') {
      cout << "This is a RNA sequence\n";
      return 0;
    }
    if (prot_alphabet.find(sequence[i]) < prot_alphabet.size() && dna_alphabet.find(sequence[i]) > dna_alphabet.size()) {
      cout << "This is a Protein sequence\n";
      return 0;
    }
      ++i;
  }
  cout << "This is a DNA sequence\n";
  return 0;
}

ADD COMMENT • link updated 7.2 years ago by Ram 45k • written 14.8 years ago by Bilouweb ★ 1.1k

1

Entering edit mode

Suppose you have a sequence "ACGAGCAGGC", is it RNA, DNA, or protein?

ADD REPLY • link 14.8 years ago by Farhat ★ 2.9k

0

Entering edit mode

i like the simple approach. just fyi, input " " gets called protein.

ADD REPLY • link 14.8 years ago by brentp 24k

0

Entering edit mode

Yep, it's a first draft ;)

ADD REPLY • link 14.8 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

using C++, (for a large alphabet ?) you can make it even faster by sorting your alphabet and using std::lower_bound

ADD REPLY • link 14.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

@Farhat: even a human cannot answer this question... so a computer ? who knows ;)

@Pierre: For large alphabet, using lower_bound suppose the alphabet is sorted. And if it is sorted, I think string::find have the same complexity as lower_bound (not really sure but it is logical).

ADD REPLY • link 14.8 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

@Pierre : lower_bound uses a dichotomy approach so it is better for large alphabets.

ADD REPLY • link 14.8 years ago by Bilouweb ★ 1.1k

score 1 · Answer 5 · 2011-01-21

1

Entering edit mode

14.8 years ago

Pierre Lindenbaum 166k

In C, see the standard function strspn: http://www.cplusplus.com/reference/clibrary/cstring/strspn/

ADD COMMENT • link 14.8 years ago by Pierre Lindenbaum 166k