As part of a script I am writing, I am trying to make a validator to check what format the input file is - i.e whether the input fasta file contains protein data or nucleic data.
I have tried to start this, but it is not as simple as a simple match - e.g.
if match(/[^a^t^c^g]/i) ... for the DNA since these characters will most probably be included in the sequence description.
Since the input for the script would be the either the full genome data or the full proteome data for a species, they will be quite large files. Thus, I was planning to just test the first 500 sequences (as otherwise, I suppose this could take quite some time with large files). But then again, any opinions.
At the moment I plan to use bioruby to take the 500 first sequences and then try matching the sequence -
if match(/[^a^t^c^g]/i) for the dna and make a similar one for protein.
But Before I start on this properly, (and the reason behind this post), being a begginner I was wondering whether there are any tools (e.g. within bioruby) that would do this for me...
Since this would be part of a ruby script, any answers need to be in ruby