Question: perl count fasta sequences
0
gravatar for cabraham03
4.3 years ago by
cabraham0320
Mexico
cabraham0320 wrote:

 

 

I have the next code that extract the sequences from a FASTQ file and convert in FASTA file; and the end make a count of the sequences available in the fasta file; the problem is the count is not Correct, the result is always less than the sequences available in the fasta output file.  

#FIRST SCRIPT

#!/usr/bin/perl -w
use strict;
use Getopt::Long;
use Term::ANSIColor; 

# Variables
#--------------------------------------------------------------------------------------------------------------------------
my ($imput, $output, $line, $usage, $help_desciption, $help, $options, $op_desciption);


GetOptions (
            'i=s' => \$imput,
            'o=s' => \$output,
            'h'   => \$help,
            'op'  => \$options
            );


# help 
#--------------------------------------------------------------------------------------------------------------------------
$help_desciption = (qq(
                    This Script extract the sequences from a FASTQ file and convert it to FASTA format
                    The sequences must be in the same file. It will generate a fasta file with the sequences
                    in the same order and name that are available in the imput file.
                    
                    Example:
                    fastxQA -i file.fastq -o file.fasta
                        or  for more options see
                    fastxQA -op     
                    ));

if ($help) {
    print "$help_desciption\n\n";
    exit;
}

# options
#--------------------------------------------------------------------------------------------------------------------------
$op_desciption = (qq(
           -i  = input  file in a .fastq format                     required
           -o  = output file in a .fa, .fas, .fasta or .txt format  required
           -h  = help                                               optional
           -op = options                                            optional
           ));


if ($options) {
    print "$op_desciption\n\n";
    exit;
}

# Error
#--------------------------------------------------------------------------------------------------------------------------

$usage = (qq(
          Error: 
             Wrong Arguments
             
             Usage:
             fastxQA -i infile.fastq -o utfile.fasta

));

if (!$imput or !$output) {
    print color("red"), "$usage", color("reset"),"\n\n";
    exit;
}

 

# Open Data
#--------------------------------------------------------------------------------------------------------------------------
open FASTQIN, '<', "$imput" or die (color("red"), "\n\t\tCan't open $imput file\n\t\t\t or\n\t\t Does not exist such file", color("reset"),"\n\n"); 
open FASTAOUT, '>', "$output" or die (color("red"), "Can't genenate $output file", color("reset"),"\n\n");

 

# Processing Message 
#--------------------------------------------------------------------------------------------------------------------------
print "\n";
$| = 1;
print colored("\t\t  This Process Will Take a While  ", "blue blink");
#sleep 2;
print "\r";


# Extract sequences 
#--------------------------------------------------------------------------------------------------------------------------
while(
defined(my $head = <FASTQIN>) &&     
defined(my $seq = <FASTQIN>) &&      
defined(my $qhead = <FASTQIN>) &&    # te
defined(my $quality = <FASTQIN>)     # cuarta linea
){
  substr($head, 0, 1, '>'); #0 elimina el primer caracter (@) de la primer linea; el 1 remplaza la cantidad de caracteres, este caso solo 1 caracter
  
  
  print FASTAOUT $head, $seq;

}

# Count sequences "HAY UN ERROR DA 10 SEQUENCIAS MENOS "
#--------------------------------------------------------------------------------------------------------------------------
open COUNTERSEQ, '<', "$output" or die (color("red"), "Can't open $output file", color("reset"),"\n\n");
my $count = 0;

while( <COUNTERSEQ> ){
    chomp $_;
       if($_=~ /^>/g ) 
       {
             $count++; 
    }
}

 

 

# Close and Exit
#--------------------------------------------------------------------------------------------------------------------------
  close FASTQIN;
  close FASTAOUT;
  close COUNTERSEQ;
  print color("blue"),"                                      DONE                               \n\n", color("reset");
  print color("blue"),"              -  $count  sequences were converted to   $output File- \n\n", color("reset");
  exit;
  #--------------------------------------------------------------------------------------------------------------------------------------

 

 

However when I try the code for counting  separately it works well: 

#----------------------------------------------------------------------------------------------------------------------------------------

#SECOND SCRIPT

#!/usr/bin/perl -w
use strict;

my $infile =$ARGV[0];
open COUNTERSEQ, '<', $infile, or die;

my $count = 0;

while( <COUNTERSEQ> ){
    chomp $_;
       if($_=~ /^>/g ) 
       {
             $count++; 
    }
}

print "$count\n";

 

 

Does anybody can help me to say what is the error in the connection in the first Script among the extracting section and the count section !!???

Thanks So Much !!

sequence fasta perl • 2.6k views
ADD COMMENTlink modified 4.3 years ago by thackl2.7k • written 4.3 years ago by cabraham0320
1

I didn't understand what you're trying to do, but if all you want is to count the number of sequences in a fasta file, you can just:

grep -c ">" input.fasta

ADD REPLYlink written 4.3 years ago by arnstrm1.7k

the first script extract the sequences from FASTQ to FASTA and the the final message just print the total number of the sequences that are available in the FASTA file !!..... Thanks 

ADD REPLYlink written 4.3 years ago by cabraham0320
2

You could just do it with:

#fastq to fasta
cat test.fastq | paste - - - - | sed 's/^@/>/g'| cut -f1-2 | tr '\t' '\n'
#Count fasta
grep -c ">" input.fasta
ADD REPLYlink written 4.3 years ago by arnstrm1.7k

I think

grep -A 1 "@" | sed 's/^@/>/g'

would more or less do the trick, no?

ADD REPLYlink written 4.3 years ago by swbarnes26.7k
1

No, because "@" is 31 in Phred+33 scale (even "^@" wont work, there might be a base at the beginning with a "@" score)

ADD REPLYlink written 4.3 years ago by arnstrm1.7k
3
gravatar for thackl
4.3 years ago by
thackl2.7k
MIT
thackl2.7k wrote:

Try closing FASTAOUT before opening COUNTERSEQ. This might be a buffering issue.

ADD COMMENTlink written 4.3 years ago by thackl2.7k

Thanks So much; It works well CLOSING before !!! 

THANKS TO ALL OF YOU !! 

ADD REPLYlink written 4.3 years ago by cabraham0320
1
gravatar for Zev.Kronenberg
4.3 years ago by
United States
Zev.Kronenberg11k wrote:

If this is a programming learning experience:

How To Parse Fasta Files In Perl

otherwise:

grep ">" your.fasta | wc -l

ADD COMMENTlink written 4.3 years ago by Zev.Kronenberg11k
1

Do not miss the quotations marks or you will be sad.

ADD REPLYlink written 4.3 years ago by Zev.Kronenberg11k

or even:

grep -c '>' your.fasta

ADD REPLYlink written 4.2 years ago by Chris Cole720
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2062 users visited in the last hour