Question: Check If A File Is In Fasta Format
0
gravatar for sana.atique.khan
7.0 years ago by
sana.atique.khan0 wrote:

I am trying to write a code which asks for a file (if the first time, an invalid filename is given, it asks for file 5 times until exhausting), then it checks if the file is in fasta format.

how to code that? I have the following code so far.

#!/usr/bin/perl -w
#A program that asks for a file, opens it if file exists and check
#if the file is in FASTA format
use strict;
#get data from a file
my @file = openfile();

#open file
#subroutines
sub openfile {
my $filename;
my $x;
    my $datafile;
    my $file;

for  ($x = 0; $x<5; $x++) {
print "\n\nPlease enter file name: ";
chomp ($filename = <STDIN>);

if (-e $filename) {
print "File found!\n\n";
    exit;
         } else {
        if ($x<4) {
        print "Invalid file name!\n\n";
        } else { 
                print "Five tries were unsuccessful! Please check and try again!\n\n";
                        }
                    }
                }
        return;
        }
fasta perl • 7.4k views
ADD COMMENTlink modified 7.0 years ago by ngsgene350 • written 7.0 years ago by sana.atique.khan0
12
gravatar for Neilfws
7.0 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

First, there is no need to reinvent the wheel. Use the SeqIO module from Bioperl:

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

my $seqio = Bio::SeqIO->new(-file => "myfile.fa", -format => "fasta");
while(my $seq = $seqio->next_seq) {
  # do stuff with sequences...
}

If the fasta file is invalid, this code will throw an exception, for example:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: The sequence does not appear to be FASTA format (lacks a descriptor line '>')

Second, don't waste time checking for multiple incorrect attempts. Once is enough :)

ADD COMMENTlink written 7.0 years ago by Neilfws48k

Thanks!!! I will try it out!

ADD REPLYlink written 7.0 years ago by sana.atique.khan0
2
gravatar for Pierre Lindenbaum
7.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

You could create a simple grammar for FASTA using GNU-Bison:

%{
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>

int yylex();
int yyerror( char* message);
%}
%error-verbose
%token LT OTHER SYMBOL CR
%start input
%%

input:   input  sequence | optspaces sequence;
sequence: head body optspaces;
head: LT anylist CR | LT CR;
anylist: anylist any | any;
any: LT | OTHER | SYMBOL;
body: symbols CR | body symbols CR ;
symbols: symbols symbol | symbol ;
symbol: SYMBOL;
optspaces: | crlist;
crlist: crlist CR | CR;

%%
int yyerror( char* message)
    {
    fprintf(stderr,"NOT A FASTA %s\n",message);
    exit(EXIT_FAILURE);
    return -1;
    }
int yylex()
    {
    int c=fgetc(stdin);
    switch(c)
        {
        case EOF: return c;
        case '>' : return LT;
        case '\n' : return CR;
        default: return isalpha(c)?SYMBOL:OTHER;
        }
    }

int main(int argc, char** argv)
    {
    return yyparse();
    }

and use it to test if a file is a fasta file:

#compile
bison fasta.y
gcc -Wall -O3 fasta.tab.c

#test
$ ./a.out < ~/file.xml
NOT A FASTA syntax error, unexpected OTHER, expecting LT

$ ./a.out < ~/rotavirus.fasta
$
ADD COMMENTlink written 7.0 years ago by Pierre Lindenbaum119k
2

coming up with a set of these for popular formats would actually be a nice addition to any Makefile or similar pipeline: "BioValidators by Pierre"

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by Jeremy Leipzig18k
1
gravatar for ngsgene
7.0 years ago by
ngsgene350
United States
ngsgene350 wrote:

Unless I am reading this too straightforwardly, you simply need to add an if to test if a file is in fasta format with condition if($l ~=/>/)

if the line ($l) contains ">" you're good to go.

 while(my $l = <DAT>) {
    chomp $l;
    if ($l ~= />/) {
    do this
    }
    elsif ($ !~ />/) {
    do this
    }
}
ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by ngsgene350
1

This is a little simplistic. You should at least check whether ">" is at the start of the first line, using /^>/. Also, there should be a check for no space after ">". And then there is the problem of valid sequence lines.

ADD REPLYlink written 7.0 years ago by Neilfws48k
1

True, its more useful for parsing a fasta file when you know its a fasta file.

ADD REPLYlink written 7.0 years ago by ngsgene350
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1587 users visited in the last hour