I am undergrad learning Perl programming for bioinformatics, and having problems writing a script. I wanted to know if any one has script that counts the number of sequences in a fastq but excludes everything else such as the line that begins with @ and + and the quality score. Thanks
Due to the vagaries of the FastQ definition parsing it correctly is a little trickier than one might initially think. This is due to the fact that the symbol indicating the start of a new record
@ is also a valid choice for quality measure. Thus one always needs to keep track of the lenght of the sequence and use that as guidance on how many quality measures to read in.
On the other hand most fastq files tend to be formatted with the entire sequence (and quality) on a single line. In these cases parsing is trivial as it turns into the problem of correctly identifying the 4 lines that form a fastq record. For example to find the sequences use a modulo division to find the remainder and print the line if the remainder is equal to 1 (assuming that the line numbering starts at 0).
All data produced by short read sequencers are in this latter format.
I made such a script here. You can see how I parsed it.
It converts multi-line fastq file entries to a four-line style.
@freddy Here is my super simple answer for a beginner using perl to count the number of reads in a fastq file. Of course there are perl and awk one-liners that get the job done, but the following script really spells things out for you. All you are doing is reading through every 4 lines and increasing the total read count by 1. The following code can be copy and pasted to make a complete perl script:
#!/usr/bin/perl -w ## This specifies the file as a perl script
$line_position = 0;
$count = 0;
open(INPUT,$ARGV) || die("Can't open file"); ## This opens the first user-privded argument as the input file.
while(<INPUT>) ## This reads each line of the file in, one at a time
if($line_position == 4) ## If the script has already seen 4 lines, then reset the line counter and add 1 to the read counter!
$line_position = 0;
print"Number of reads: $count\n"; ## This pring out your read counts!
An easier method would be to just do a UNIX line count on the file and divide by 4.
wc -l myFile.fastq. This is all assuming you have a standard .fastq that has 4 lines per read. Good luck!