List All The Tools Or Write A Script To Validate That A Sequence Only Contains Letters From A Given Alphabet
5
5
Entering edit mode
11.9 years ago
Biostar User ★ 1.0k

How do I verify that a sequence only contains letters from a given alphabet: DNA, RNA, protein?

dna sequence • 8.9k views
0
Entering edit mode

the idea of this discussion was to collect all the possible ways and tools to determine if a sequence is DNA or protein

0
Entering edit mode

If you give the programming language you want, we can answer more precisely ;)

0
Entering edit mode

@bilouweb: just write the solution in the programming language you prefer, other people will implement it in the other languages.

7
Entering edit mode
11.9 years ago
Biostar User ★ 1.0k

Here is an efficient function written in Python:

dna = set("ATGC")
def validate(seq, alphabet=dna):
"Checks that a sequence only contains values from an alphabet"
leftover = set(seq.upper()) - alphabet
return not leftover

# typical usage below

# this will print True
print validate('AAAATGCCG')

# this will print False
print validate('AAANTGCCG')

# using it with other alphabets
prot = set('ACDEFGHIKLMNPQRSTVWY')
print validate("mglsdgewql", alphabet=prot)

0
Entering edit mode

nice, but I would include the N in the dna set.

0
Entering edit mode

these are an old question and answer.. I would use regex

3
Entering edit mode
11.0 years ago

Use regular expressions. In python, you can do:

import re

def validate(seq, alphabet='dna'):
"""
Check that a sequence only contains values from an alphabet

>>> seq1 = 'acgatGAGGCATTtagcgatgcgatc'       # True for dna and protein
>>> seq2 = 'FGFAGGAGFGAFFF'                   # False for dna, True for protein
>>> seq3 = 'acacac '                          # should return False (a space is not a nucleotide)
>>> validate(seq1, 'dna')
True
>>> validate(seq2, 'dna')
False
>>> validate(seq2, 'protein')
True
>>> validate(seq3, 'dna')
False

"""
alphabets = {'dna': re.compile('^[acgtn]*$', re.I), 'protein': re.compile('^[acdefghiklmnpqrstvwy]*$', re.I)}

if alphabets[alphabet].search(seq) is not None:
return True
else:
return False

if __name__ == "__main__":
import doctest
doctest.testmod(verbose=True)

3
Entering edit mode
11.0 years ago
Dave Lunt ★ 2.0k

In BioPerl SeqIO will guess the alphabet of a sequence. There are often warnings that these guesses can be imperfect! This is an issue with many of the home-brew answers here, they don't use the full IUPAC code. BioPerl is aware of the full IUPAC alphabet, and I would guess BioPython etc are too.

If you have a sequence containing only ACGT this can be correctly identified as either amino acids or nucleotides. Both are correct. A solution will therefore have to estimate a likelihood, perhaps given an observed frequency of amino acid residues in proteins of your data set?. This can be simple sometimes- if it contains an E residue it can't be a valid nucleotide sequence. Other times you will really need to specify biological estimates of likelihood of encountering certain residues in a sequence of a given length. I don't know of any software that does this, hence all are a bit error prone, at least with short or biased-composition sequences.

Take home message; its not quite so simple as sometimes assumed. If you want a good basic guess, one of the Bio* projects wil probably be as good as anything.

2
Entering edit mode
11.0 years ago
Bilouweb ★ 1.1k

From a theoretical point of view, to make this quickly, you can base your code on the specificity of each alphabet. (Here I suppose you just want to know if a sequence is protein, DNA or RNA, based on Giovanni's comment)

RNA contains 'U', and not the others.

Proteins have much more letters than DNA and RNA.

So here is a piece of code in C++

#include <iostream>
#include <string>

/* suppose you give the raw sequence in capital *
* letters in argument of your program          */

using namespace std;

int main(int argc, char **argv){
string dna_alphabet  = "ATCG";
string rna_alphabet  = "AUCG";
string prot_alphabet = "ATCGRNDEQHILKMFPSWYV";
char * sequence = argv[1];

int i = 0;
while (sequence[i] != '\0'){
if (sequence[i] == 'U') {
cout << "This is a RNA sequence\n";
return 0;
}
if (prot_alphabet.find(sequence[i]) < prot_alphabet.size() && dna_alphabet.find(sequence[i]) > dna_alphabet.size()) {
cout << "This is a Protein sequence\n";
return 0;
}
++i;
}
cout << "This is a DNA sequence\n";
return 0;
}

1
Entering edit mode

Suppose you have a sequence "ACGAGCAGGC", is it RNA, DNA, or protein?

0
Entering edit mode

i like the simple approach. just fyi, input " " gets called protein.

0
Entering edit mode

Yep, it's a first draft ;)

0
Entering edit mode

using C++, (for a large alphabet ?) you can make it even faster by sorting your alphabet and using std::lower_bound

0
Entering edit mode

@Farhat: even a human cannot answer this question... so a computer ? who knows ;)

@Pierre: For large alphabet, using lower_bound suppose the alphabet is sorted. And if it is sorted, I think string::find have the same complexity as lower_bound (not really sure but it is logical).

0
Entering edit mode

@Pierre : lower_bound uses a dichotomy approach so it is better for large alphabets.

1
Entering edit mode
11.0 years ago

In C, see the standard function strspn: http://www.cplusplus.com/reference/clibrary/cstring/strspn/