Question: How To Write A Perl Script To Parse Fastq Files
gravatar for freddy
6.9 years ago by
freddy40 wrote:

I am undergrad learning Perl programming for bioinformatics, and having problems writing a script. I wanted to know if any one has script that counts the number of sequences in a fastq but excludes everything else such as the line that begins with @ and + and the quality score. Thanks

perl fastq script • 6.7k views
ADD COMMENTlink modified 6.9 years ago by JacobS920 • written 6.9 years ago by freddy40

First try it yourself. If it doesnt work, try these posts. They are pretty relevant to what you are trying to do Parse fastq file - pad reads with N's

Fastq Quality Read and Score Length Check

ADD REPLYlink modified 9 months ago by RamRS30k • written 6.9 years ago by Ashutosh Pandey12k
  You can follow this approach:
  Step 1: use next if for skipping any line 
  Step 2: take a variable as string and concatenate every line to this strings then take the length of whole string.
  Always remember that perl always read file line by line
ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by always_learning1.0k
gravatar for Istvan Albert
6.9 years ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

Due to the vagaries of the FastQ definition parsing it correctly is a little trickier than one might initially think. This is due to the fact that the symbol indicating the start of a new record @ is also a valid choice for quality measure. Thus one always needs to keep track of the lenght of the sequence and use that as guidance on how many quality measures to read in.

On the other hand most fastq files tend to be formatted with the entire sequence (and quality) on a single line. In these cases parsing is trivial as it turns into the problem of correctly identifying the 4 lines that form a fastq record. For example to find the sequences use a modulo division to find the remainder and print the line if the remainder is equal to 1 (assuming that the line numbering starts at 0).

All data produced by short read sequencers are in this latter format.

ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Istvan Albert ♦♦ 84k
gravatar for Lee Katz
6.9 years ago by
Lee Katz3.0k
Atlanta, GA
Lee Katz3.0k wrote:

I made such a script here. You can see how I parsed it.

It converts multi-line fastq file entries to a four-line style.

ADD COMMENTlink written 6.9 years ago by Lee Katz3.0k
gravatar for JacobS
6.9 years ago by
Cleveland, Ohio
JacobS920 wrote:

@freddy Here is my super simple answer for a beginner using perl to count the number of reads in a fastq file. Of course there are perl and awk one-liners that get the job done, but the following script really spells things out for you. All you are doing is reading through every 4 lines and increasing the total read count by 1. The following code can be copy and pasted to make a complete perl script:

#!/usr/bin/perl -w ## This specifies the file as a perl script

$line_position = 0;

$count = 0;

open(INPUT,$ARGV[0]) || die("Can't open file"); ## This opens the first user-privded argument as the input file.

while(<INPUT>) ## This reads each line of the file in, one at a time



if($line_position == 4) ## If the script has already seen 4 lines, then reset the line counter and add 1 to the read counter!


$line_position = 0;





print"Number of reads: $count\n"; ## This pring out your read counts!

An easier method would be to just do a UNIX line count on the file and divide by 4. wc -l myFile.fastq. This is all assuming you have a standard .fastq that has 4 lines per read. Good luck!

ADD COMMENTlink written 6.9 years ago by JacobS920
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 707 users visited in the last hour