I am sure there must be a tool out there that does it and does it fast? Parsing each file with a custom script is an option but I have big files and want something efficient. Too bad FastQC does not seem to provide this option.
I am sure there must be a tool out there that does it and does it fast? Parsing each file with a custom script is an option but I have big files and want something efficient. Too bad FastQC does not seem to provide this option.
zcat file.fq.gz | paste - - - - | cut -f2 | wc -c
zcat: print gzipped file (if the file is not zipped, just use cat)
paste - - - -: print four consecutive lines in one row (tab delimited)
cut -f2: print only the second column (after paste this is the second line of the fastq-format, meaning the sequence)
wc -c: count the characters
awk 'BEGIN{sum=0;}{if(NR%4==2){sum+=length($0);}}END{print sum;}' file.fq
Thanks but I am a little bit unsure if this works 100% properly... As a test, I had a look at a raw fastq file which I run with FastQC. FastQC claims that all the reads are 100 bp, so I calculated the number of bases by simply multiplying the number of reads by 100. The number I get from doing so is 676,539bp (994,675,300 ver 993,998,761) bigger than the one from running your awk command on the file. Do you have any idea about the source of this discrepancy?
I'd probably go with something simple like:
grep -E '^[ACTGN]+$' | perl -pe 's/[[:space:]]//g' | wc -c
The assumption here is that you want to count all characters on all lines that contain only one of ACTG or N.
You can also use fastx_quality_stats
from the fastx toolkit. It reports the total number of bases, among other things.
Cool, I never thought about the added cost of using extended regular expressions. However, your code is not as safe if you test only for the first character on the line. Fastq files vary and some may even have the name of the sequences starting with one of ACTGN, in which case you would end up counting the characters of unwanted lines. I tend to go with safer over faster. I find it actually saves a LOT of time down the line.
This assumption is not correct because many quality lines start by letter like CGAT or N, then you are adding to the count the characters from the lines of the quality that start with this values, remember that the ASCII code include all the letters of the alphabet!!!! The wak Samuel's script works perfectly, sure that the problem is that maybe you are not including the quality data in your data test or are not included in the correct line, because all the fastq files have each 4 lines the nucleotide sequence....
Correct; the only safe approach is to require 4-line properly-formatted fastq. reformat.sh uses this approach, and will additionally verify certain properties like @
and #
symbols being in the right places, and the number of bases equaling the number of quality scores, to help ensure the input is valid.
$ reformat.sh in=reads.fq
Input: 4000000 reads 297852680 bases
Time: 2.339 seconds.
Please try this: https://github.com/billzt/readfq Smart and Fast!!!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Is the read length the same for all entries in the fastq file? In that case, you could just count the number of lines with a unix command like 'wc -l' which is probably as fast as it gets. Then divide the result by 4 and multiply by the read length.