List All The Tools Or Write A Script To Validate That A Sequence Only Contains Letters From A Given Alphabet
5
5
Entering edit mode
11.9 years ago
Biostar User ★ 1.0k

How do I verify that a sequence only contains letters from a given alphabet: DNA, RNA, protein?

dna sequence • 8.9k views
ADD COMMENT
0
Entering edit mode

the idea of this discussion was to collect all the possible ways and tools to determine if a sequence is DNA or protein

ADD REPLY
0
Entering edit mode

If you give the programming language you want, we can answer more precisely ;)

ADD REPLY
0
Entering edit mode

@bilouweb: just write the solution in the programming language you prefer, other people will implement it in the other languages.

ADD REPLY
7
Entering edit mode
11.9 years ago
Biostar User ★ 1.0k

Here is an efficient function written in Python:

dna = set("ATGC")
def validate(seq, alphabet=dna):
    "Checks that a sequence only contains values from an alphabet"
    leftover = set(seq.upper()) - alphabet
    return not leftover

# typical usage below

# this will print True
print validate('AAAATGCCG')

# this will print False
print validate('AAANTGCCG')

# using it with other alphabets
prot = set('ACDEFGHIKLMNPQRSTVWY')
print validate("mglsdgewql", alphabet=prot)
ADD COMMENT
0
Entering edit mode

nice, but I would include the N in the dna set.

ADD REPLY
0
Entering edit mode

these are an old question and answer.. I would use regex

ADD REPLY
3
Entering edit mode
11.0 years ago

Use regular expressions. In python, you can do:

import re

def validate(seq, alphabet='dna'):
    """
    Check that a sequence only contains values from an alphabet

    >>> seq1 = 'acgatGAGGCATTtagcgatgcgatc'       # True for dna and protein
    >>> seq2 = 'FGFAGGAGFGAFFF'                   # False for dna, True for protein
    >>> seq3 = 'acacac '                          # should return False (a space is not a nucleotide)
    >>> validate(seq1, 'dna')
    True
    >>> validate(seq2, 'dna')
    False
    >>> validate(seq2, 'protein')
    True
    >>> validate(seq3, 'dna')
    False

    """
    alphabets = {'dna': re.compile('^[acgtn]*$', re.I), 
             'protein': re.compile('^[acdefghiklmnpqrstvwy]*$', re.I)}


    if alphabets[alphabet].search(seq) is not None:
         return True
    else:
         return False

if __name__ == "__main__":
    import doctest
    doctest.testmod(verbose=True)
ADD COMMENT
3
Entering edit mode
11.0 years ago
Dave Lunt ★ 2.0k

In BioPerl SeqIO will guess the alphabet of a sequence. There are often warnings that these guesses can be imperfect! This is an issue with many of the home-brew answers here, they don't use the full IUPAC code. BioPerl is aware of the full IUPAC alphabet, and I would guess BioPython etc are too.

If you have a sequence containing only ACGT this can be correctly identified as either amino acids or nucleotides. Both are correct. A solution will therefore have to estimate a likelihood, perhaps given an observed frequency of amino acid residues in proteins of your data set?. This can be simple sometimes- if it contains an E residue it can't be a valid nucleotide sequence. Other times you will really need to specify biological estimates of likelihood of encountering certain residues in a sequence of a given length. I don't know of any software that does this, hence all are a bit error prone, at least with short or biased-composition sequences.

Take home message; its not quite so simple as sometimes assumed. If you want a good basic guess, one of the Bio* projects wil probably be as good as anything.

ADD COMMENT
2
Entering edit mode
11.0 years ago
Bilouweb ★ 1.1k

From a theoretical point of view, to make this quickly, you can base your code on the specificity of each alphabet. (Here I suppose you just want to know if a sequence is protein, DNA or RNA, based on Giovanni's comment)

RNA contains 'U', and not the others.

Proteins have much more letters than DNA and RNA.

So here is a piece of code in C++

#include <iostream>
#include <string>

/* suppose you give the raw sequence in capital *
 * letters in argument of your program          */

using namespace std;

int main(int argc, char **argv){
  string dna_alphabet  = "ATCG";
  string rna_alphabet  = "AUCG";
  string prot_alphabet = "ATCGRNDEQHILKMFPSWYV";
  char * sequence = argv[1];

  int i = 0;
  while (sequence[i] != '\0'){
    if (sequence[i] == 'U') {
      cout << "This is a RNA sequence\n";
      return 0;
    }
    if (prot_alphabet.find(sequence[i]) < prot_alphabet.size() && dna_alphabet.find(sequence[i]) > dna_alphabet.size()) {
      cout << "This is a Protein sequence\n";
      return 0;
    }
      ++i;
  }
  cout << "This is a DNA sequence\n";
  return 0;
}
ADD COMMENT
1
Entering edit mode

Suppose you have a sequence "ACGAGCAGGC", is it RNA, DNA, or protein?

ADD REPLY
0
Entering edit mode

i like the simple approach. just fyi, input " " gets called protein.

ADD REPLY
0
Entering edit mode

Yep, it's a first draft ;)

ADD REPLY
0
Entering edit mode

using C++, (for a large alphabet ?) you can make it even faster by sorting your alphabet and using std::lower_bound

ADD REPLY
0
Entering edit mode

@Farhat: even a human cannot answer this question... so a computer ? who knows ;)

@Pierre: For large alphabet, using lower_bound suppose the alphabet is sorted. And if it is sorted, I think string::find have the same complexity as lower_bound (not really sure but it is logical).

ADD REPLY
0
Entering edit mode

@Pierre : lower_bound uses a dichotomy approach so it is better for large alphabets.

ADD REPLY
1
Entering edit mode
11.0 years ago

In C, see the standard function strspn: http://www.cplusplus.com/reference/clibrary/cstring/strspn/

ADD COMMENT

Login before adding your answer.

Traffic: 2402 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6