Question: Validation - Test What Type Of Data The Fasta Format Input File Contains (I.E. Whether Dna Or Protein Data)
gravatar for IsmailM
6.2 years ago by
UK/London/University College London
IsmailM110 wrote:


As part of a script I am writing, I am trying to make a validator to check what format the input file is - i.e whether the input fasta file contains protein data or nucleic data.

I have tried to start this, but it is not as simple as a simple match - e.g. if match(/[^a^t^c^g]/i) ... for the DNA since these characters will most probably be included in the sequence description.

Since the input for the script would be the either the full genome data or the full proteome data for a species, they will be quite large files. Thus, I was planning to just test the first 500 sequences (as otherwise, I suppose this could take quite some time with large files). But then again, any opinions.

At the moment I plan to use bioruby to take the 500 first sequences and then try matching the sequence - if match(/[^a^t^c^g]/i) for the dna and make a similar one for protein.

But Before I start on this properly, (and the reason behind this post), being a begginner I was wondering whether there are any tools (e.g. within bioruby) that would do this for me...

Many Thanks

Since this would be part of a ruby script, any answers need to be in ruby

sequence format • 3.4k views
ADD COMMENTlink modified 5.8 years ago • written 6.2 years ago by IsmailM110

Count the frequency of A, T, G, and C !! :) If if sum is not equal to ~~ 1 then its not Nucleotide !!

ADD REPLYlink written 6.2 years ago by always_learning1000

I am not sure exactly what you mean - isn't that differentiating between a single nucleotide and a sequence - I am trying to differentiate between protein sequence and DNA sequence not a sequence and a nucleotide...

ADD REPLYlink written 6.2 years ago by IsmailM110

Yes, I understand that !! First you will count occurrence of A and then occurrence of T and then occurrence of G and then occurence of C !! and then get the complete length also ..

if occurence of A + occurence of T + occurence of C + occurence of A = complete length, then its Nucletide sequence .

ADD REPLYlink written 6.2 years ago by always_learning1000

Else start reading file from start and moment you will get any word apart from these A,T, G and C, then break it and print that its Amino acids !! As its highly impossible for to NOT getting any words apart from these 4 in amino acids for a length of say 1000 sequences !!

ADD REPLYlink written 6.2 years ago by always_learning1000
gravatar for Damian Kao
6.2 years ago by
Damian Kao15k
Damian Kao15k wrote:

Here are the possible codes for nucleotides and amino acids:

As you can see, there are a lot of overlap between the two sets. The only letters that are not in protein and in nucleotide are 'B' and 'U'. You probably won't see 'U' that much.

The best way is to probably just count frequencies of ATGC as any good nucleotide sequence should be predominantly those 4 letters. Set the cut-off to something like 50%. I don't think there are any naturally occurring protein sequences that are 50% Ala, Gly, Cys, Thr. So if 50% of the sequence are ATGC, then it is nucleotide. This might not work for very short sequences though, so I would sample at least a few hundred letters.

It won't work in all cases, but it should work in most reasonable cases.

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Damian Kao15k

The IUPAC and IUBMB recommendation documents, which describe the one letter encodings for nucleotide and amino-acid sequences, can be found at:

From these you will notice that the amino-acid code 'B' is reserved for "aspartic acid or asparagine" due to uncertain result of hydrolysis during Edman sequencing, and the amino-acid code 'U' is used for selenocysteine. You may notice that 'J' and 'O' are missing from the amino-acid nomenclature described in these documents, these have been subsequently assigned as:

  • 'J': leucine or isoleucine, an uncertain result of mass-spec sequencing
  • 'O': pyrrolysine

A convenient summary of the current amino-acid one letter code can be found at:

At this point any sequence of characters from the English alphabet is a valid protein sequence. Unfortunately this also includes any valid DNA sequence, and most RNA sequences (excepting those containing pseudouridine).

While it is possible to say that a sequence is not a nucleotide sequence, without additional information it is not possible to definitively say that a sequence is not protein sequence. That said the proportion approach you suggest will work in the majority of cases (EMBOSS uses something similar to guess the sequence type). However there are always exceptions that cause trouble, so it is a good idea to provide users with some way of explicitly specifying the sequence type.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Hamish3.1k
gravatar for IsmailM
5.8 years ago by
UK/London/University College London
IsmailM110 wrote:

This is what I used in the end; Perhaps it will be helpful for someone else - There is a bioruby method Bio::Sequence#guess

So this is how you use it...

s ='atgc')
puts s.guess                            #=> Bio::Sequence::NA

You can also specify a threshold frequency - default is 90%

s ='atgcatgcqq')
puts s.guess                            #=> Bio::Sequence::AA
puts s.guess(0.8)                       #=> Bio::Sequence::AA
puts s.guess(0.7)                       #=> Bio::Sequence::NA

You can even limit the guessing to the first 1000 or so characters...

puts s.guess(0.9, 1000)  # limit the guess to the first 1000 positions

Hope that helps...

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by IsmailM110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1866 users visited in the last hour