Question

perl count fasta sequences

0

Entering edit mode

8.8 years ago

cabraham03 ▴ 30

I have the next code that extract the sequences from a FASTQ file and convert in FASTA file; and the end make a count of the sequences available in the fasta file; the problem is the count is not Correct, the result is always less than the sequences available in the fasta output file.

FIRST SCRIPT

#!/usr/bin/perl -w
use strict;
use Getopt::Long;
use Term::ANSIColor;

# Variables
#--------------------------------------------------------------------------------------------------------------------------
my ($imput, $output, $line, $usage, $help_desciption, $help, $options, $op_desciption);


GetOptions (
            'i=s' => \$imput,
            'o=s' => \$output,
            'h'   => \$help,
            'op'  => \$options
            );


# help
#--------------------------------------------------------------------------------------------------------------------------
$help_desciption = (qq(
                    This Script extract the sequences from a FASTQ file and convert it to FASTA format
                    The sequences must be in the same file. It will generate a fasta file with the sequences
                    in the same order and name that are available in the imput file.

                    Example:
                    fastxQA -i file.fastq -o file.fasta
                        or  for more options see
                    fastxQA -op
                    ));

if ($help) {
    print "$help_desciption\n\n";
    exit;
}

# options
#--------------------------------------------------------------------------------------------------------------------------
$op_desciption = (qq(
           -i  = input  file in a .fastq format                     required
           -o  = output file in a .fa, .fas, .fasta or .txt format  required
           -h  = help                                               optional
           -op = options                                            optional
           ));


if ($options) {
    print "$op_desciption\n\n";
    exit;
}

# Error
#--------------------------------------------------------------------------------------------------------------------------

$usage = (qq(
          Error:
             Wrong Arguments

             Usage:
             fastxQA -i infile.fastq -o utfile.fasta

));

if (!$imput or !$output) {
    print color("red"), "$usage", color("reset"),"\n\n";
    exit;
}

# Open Data
#--------------------------------------------------------------------------------------------------------------------------
open FASTQIN, '<', "$imput" or die (color("red"), "\n\t\tCan't open $imput file\n\t\t\t or\n\t\t Does not exist such file", color("reset"),"\n\n");
open FASTAOUT, '>', "$output" or die (color("red"), "Can't genenate $output file", color("reset"),"\n\n");

# Processing Message
#--------------------------------------------------------------------------------------------------------------------------
print "\n";
$| = 1;
print colored("\t\t  This Process Will Take a While  ", "blue blink");
#sleep 2;
print "\r";


# Extract sequences
#--------------------------------------------------------------------------------------------------------------------------
while(
defined(my $head = <FASTQIN>) &&
defined(my $seq = <FASTQIN>) &&
defined(my $qhead = <FASTQIN>) &&    # te
defined(my $quality = <FASTQIN>)     # cuarta linea
){
  substr($head, 0, 1, '>'); #0 elimina el primer caracter (@) de la primer linea; el 1 remplaza la cantidad de caracteres, este caso solo 1 caracter


  print FASTAOUT $head, $seq;

}

# Count sequences "HAY UN ERROR DA 10 SEQUENCIAS MENOS "
#--------------------------------------------------------------------------------------------------------------------------
open COUNTERSEQ, '<', "$output" or die (color("red"), "Can't open $output file", color("reset"),"\n\n");
my $count = 0;

while( <COUNTERSEQ> ){
    chomp $_;
       if($_=~ /^>/g ) 
       {
             $count++; 
    }
}

# Close and Exit
#--------------------------------------------------------------------------------------------------------------------------
  close FASTQIN;
  close FASTAOUT;
  close COUNTERSEQ;
  print color("blue"),"                                      DONE                               \n\n", color("reset");
  print color("blue"),"              -  $count  sequences were converted to   $output File- \n\n", color("reset");
  exit;
  #--------------------------------------------------------------------------------------------------------------------------------------

However when I try the code for counting separately it works well:

SECOND SCRIPT

#!/usr/bin/perl -w
use strict;

my $infile =$ARGV[0];
open COUNTERSEQ, '<', $infile, or die;

my $count = 0;

while( <COUNTERSEQ> ){
    chomp $_;
       if($_=~ /^>/g )
       {
             $count++;
    }
}

print "$count\n";

Does anybody can help me to say what is the error in the connection in the first Script among the extracting section and the count section!?

Thanks so much!

fasta sequence perl • 4.2k views

ADD COMMENT • link updated 17 months ago by Ram 43k • written 8.8 years ago by cabraham03 ▴ 30

1

Entering edit mode

I didn't understand what you're trying to do, but if all you want is to count the number of sequences in a fasta file, you can just:

grep -c ">" input.fasta

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by arnstrm ★ 1.8k

0

Entering edit mode

The first script extract the sequences from FASTQ to FASTA and the the final message just print the total number of the sequences that are available in the FASTA file!

Thanks

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by cabraham03 ▴ 30

2

Entering edit mode

You could just do it with:

#fastq to fasta
cat test.fastq | paste - - - - | sed 's/^@/>/g'| cut -f1-2 | tr '\t' '\n'
#Count fasta
grep -c ">" input.fasta

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by arnstrm ★ 1.8k

0

Entering edit mode

I think

grep -A 1 "@" | sed 's/^@/>/g'

would more or less do the trick, no?

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.8 years ago by swbarnes2 14k

1

Entering edit mode

No, because "@" is 31 in Phred+33 scale (even "^@" wont work, there might be a base at the beginning with a "@" score)

ADD REPLY • link 8.8 years ago by arnstrm ★ 1.8k

Ram · Accepted Answer · 2015-07-09

3

Entering edit mode

8.8 years ago

thackl ★ 3.0k

Try closing FASTAOUT before opening COUNTERSEQ. This might be a buffering issue.

ADD COMMENT • link 8.8 years ago by thackl ★ 3.0k

0

Entering edit mode

Thanks So much; It works well CLOSING before !!!

THANKS TO ALL OF YOU !!

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.8 years ago by cabraham03 ▴ 30

Ram · Accepted Answer · 2015-07-09

1

Entering edit mode

8.8 years ago

Zev.Kronenberg 12k

If this is a programming learning experience:

How To Parse Fasta Files In Perl

otherwise:

grep ">" your.fasta | wc -l

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 8.8 years ago by Zev.Kronenberg 12k

1

Entering edit mode

Do not miss the quotations marks or you will be sad.

ADD REPLY • link 8.8 years ago by Zev.Kronenberg 12k

0

Entering edit mode

or even:

grep -c '>' your.fasta

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.8 years ago by Chris Cole ▴ 800