Question: How To Write A Perl Script To Parse Fastq Files
0
gravatar for freddy
6.2 years ago by
freddy40
freddy40 wrote:

I am undergrad learning Perl programming for bioinformatics, and having problems writing a script. I wanted to know if any one has script that counts the number of sequences in a fastq but excludes everything else such as the line that begins with @ and + and the quality score. Thanks

perl fastq script • 6.4k views
ADD COMMENTlink modified 6.2 years ago by JacobS900 • written 6.2 years ago by freddy40
2

First try it yourself. If it doesnt work, try these posts. They are pretty relevant to what you are trying to do Parse fastq file - pad reads with N's

Fastq Quality Read and Score Length Check

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Ashutosh Pandey11k
  You can follow this approach:
  Step 1: use next if for skipping any line 
  Step 2: take a variable as string and concatenate every line to this strings then take the length of whole string.
  Always remember that perl always read file line by line
ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by always_learning1000
0
gravatar for Istvan Albert
6.2 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

Due to the vagaries of the FastQ definition parsing it correctly is a little trickier than one might initially think. This is due to the fact that the symbol indicating the start of a new record @ is also a valid choice for quality measure. Thus one always needs to keep track of the lenght of the sequence and use that as guidance on how many quality measures to read in.

On the other hand most fastq files tend to be formatted with the entire sequence (and quality) on a single line. In these cases parsing is trivial as it turns into the problem of correctly identifying the 4 lines that form a fastq record. For example to find the sequences use a modulo division to find the remainder and print the line if the remainder is equal to 1 (assuming that the line numbering starts at 0).

All data produced by short read sequencers are in this latter format.

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Istvan Albert ♦♦ 81k
0
gravatar for Lee Katz
6.2 years ago by
Lee Katz3.0k
Atlanta, GA
Lee Katz3.0k wrote:

I made such a script here. You can see how I parsed it.

http://www.cbcb.umd.edu/software/PBcR/data/run_assembly_convertMultiFastqToStandard.pl

It converts multi-line fastq file entries to a four-line style.

ADD COMMENTlink written 6.2 years ago by Lee Katz3.0k
0
gravatar for JacobS
6.2 years ago by
JacobS900
Cleveland, Ohio
JacobS900 wrote:

@freddy Here is my super simple answer for a beginner using perl to count the number of reads in a fastq file. Of course there are perl and awk one-liners that get the job done, but the following script really spells things out for you. All you are doing is reading through every 4 lines and increasing the total read count by 1. The following code can be copy and pasted to make a complete perl script:

#!/usr/bin/perl -w ## This specifies the file as a perl script

$line_position = 0;

$count = 0;

open(INPUT,$ARGV[0]) || die("Can't open file"); ## This opens the first user-privded argument as the input file.

while(<INPUT>) ## This reads each line of the file in, one at a time

{

$line_position++;

if($line_position == 4) ## If the script has already seen 4 lines, then reset the line counter and add 1 to the read counter!

{

$line_position = 0;

$count++;

}

}

close(INPUT);

print"Number of reads: $count\n"; ## This pring out your read counts!

An easier method would be to just do a UNIX line count on the file and divide by 4. wc -l myFile.fastq. This is all assuming you have a standard .fastq that has 4 lines per read. Good luck!

ADD COMMENTlink written 6.2 years ago by JacobS900
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1111 users visited in the last hour