Question

Validation - Test What Type Of Data The Fasta Format Input File Contains (I.E. Whether Dna Or Protein Data)

0

Entering edit mode

10.6 years ago

IsmailM ▴ 110

Hi,

As part of a script I am writing, I am trying to make a validator to check what format the input file is - i.e whether the input fasta file contains protein data or nucleic data.

I have tried to start this, but it is not as simple as a simple match - e.g. if match(/[^a^t^c^g]/i) ... for the DNA since these characters will most probably be included in the sequence description.

Since the input for the script would be the either the full genome data or the full proteome data for a species, they will be quite large files. Thus, I was planning to just test the first 500 sequences (as otherwise, I suppose this could take quite some time with large files). But then again, any opinions.

At the moment I plan to use bioruby to take the 500 first sequences and then try matching the sequence - if match(/[^a^t^c^g]/i) for the dna and make a similar one for protein.

But Before I start on this properly, (and the reason behind this post), being a begginner I was wondering whether there are any tools (e.g. within bioruby) that would do this for me...

Many Thanks

Since this would be part of a ruby script, any answers need to be in ruby

sequence format • 5.2k views

ADD COMMENT • link 10.1 years ago by IsmailM ▴ 110

0

Entering edit mode

Count the frequency of A, T, G, and C !! :) If if sum is not equal to ~~ 1 then its not Nucleotide !!

ADD REPLY • link 10.6 years ago by always_learning ★ 1.1k

0

Entering edit mode

I am not sure exactly what you mean - isn't that differentiating between a single nucleotide and a sequence - I am trying to differentiate between protein sequence and DNA sequence not a sequence and a nucleotide...

ADD REPLY • link 10.6 years ago by IsmailM ▴ 110

0

Entering edit mode

Yes, I understand that !! First you will count occurrence of A and then occurrence of T and then occurrence of G and then occurence of C !! and then get the complete length also ..

if occurence of A + occurence of T + occurence of C + occurence of A = complete length, then its Nucletide sequence .

ADD REPLY • link 10.6 years ago by always_learning ★ 1.1k

0

Entering edit mode

Else start reading file from start and moment you will get any word apart from these A,T, G and C, then break it and print that its Amino acids !! As its highly impossible for to NOT getting any words apart from these 4 in amino acids for a length of say 1000 sequences !!

ADD REPLY • link 10.6 years ago by always_learning ★ 1.1k

score 1 · Answer 1 · 2013-10-01

1

Entering edit mode

10.6 years ago

Damian Kao 16k

Here are the possible codes for nucleotides and amino acids: http://www.bioinformatics.org/sms/iupac.html

As you can see, there are a lot of overlap between the two sets. The only letters that are not in protein and in nucleotide are 'B' and 'U'. You probably won't see 'U' that much.

The best way is to probably just count frequencies of ATGC as any good nucleotide sequence should be predominantly those 4 letters. Set the cut-off to something like 50%. I don't think there are any naturally occurring protein sequences that are 50% Ala, Gly, Cys, Thr. So if 50% of the sequence are ATGC, then it is nucleotide. This might not work for very short sequences though, so I would sample at least a few hundred letters.

It won't work in all cases, but it should work in most reasonable cases.

ADD COMMENT • link 10.6 years ago by Damian Kao 16k

1

Entering edit mode

The IUPAC and IUBMB recommendation documents, which describe the one letter encodings for nucleotide and amino-acid sequences, can be found at:

Nucleotides (1970): http://www.chem.qmul.ac.uk/iupac/misc/naabb.html#p3
Nucleotide ambiguity codes (1984): http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html
Amino-acids (1983): http://www.chem.qmul.ac.uk/iupac/AminoAcid/A2021.html

From these you will notice that the amino-acid code 'B' is reserved for "aspartic acid or asparagine" due to uncertain result of hydrolysis during Edman sequencing, and the amino-acid code 'U' is used for selenocysteine. You may notice that 'J' and 'O' are missing from the amino-acid nomenclature described in these documents, these have been subsequently assigned as:

'J': leucine or isoleucine, an uncertain result of mass-spec sequencing
'O': pyrrolysine

A convenient summary of the current amino-acid one letter code can be found at: http://pir.georgetown.edu/resid/faq.shtml#q01

At this point any sequence of characters from the English alphabet is a valid protein sequence. Unfortunately this also includes any valid DNA sequence, and most RNA sequences (excepting those containing pseudouridine).

While it is possible to say that a sequence is not a nucleotide sequence, without additional information it is not possible to definitively say that a sequence is not protein sequence. That said the proportion approach you suggest will work in the majority of cases (EMBOSS uses something similar to guess the sequence type). However there are always exceptions that cause trouble, so it is a good idea to provide users with some way of explicitly specifying the sequence type.

ADD REPLY • link 10.6 years ago by Hamish ★ 3.2k

score 0 · Answer 2 · 2014-03-07

This is what I used in the end; Perhaps it will be helpful for someone else - There is a bioruby method Bio::Sequence#guess

So this is how you use it...

s = Bio::Sequence.new('atgc')
puts s.guess                            #=> Bio::Sequence::NA

You can also specify a threshold frequency - default is 90%

s = Bio::Sequence.new('atgcatgcqq')
puts s.guess                            #=> Bio::Sequence::AA
puts s.guess(0.8)                       #=> Bio::Sequence::AA
puts s.guess(0.7)                       #=> Bio::Sequence::NA

You can even limit the guessing to the first 1000 or so characters...

s = Bio::Sequence.new(A VERY LONG SEQUENCE)
puts s.guess(0.9, 1000)  # limit the guess to the first 1000 positions

Hope that helps...