I am sure there must be a tool out there that does it and does it fast? Parsing each file with a custom script is an option but I have big files and want something efficient. Too bad FastQC does not seem to provide this option.
zcat file.fq.gz | paste - - - - | cut -f2 | wc -c
zcat: print gzipped file (if the file is not zipped, just use cat)
paste - - - -: print four consecutive lines in one row (tab delimited)
cut -f2: print only the second column (after paste this is the second line of the fastq-format, meaning the sequence)
wc -c: count the characters
I'd probably go with something simple like:
grep -E '^[ACTGN]+$' | perl -pe 's/[[:space:]]//g' | wc -c
The assumption here is that you want to count all characters on all lines that contain only one of ACTG or N.
You can also use
fastx_quality_stats from the fastx toolkit. It reports the total number of bases, among other things.
This assumption is not correct because many quality lines start by letter like CGAT or N, then you are adding to the count the characters from the lines of the quality that start with this values, remember that the ASCII code include all the letters of the alphabet!!!! The wak Samuel's script works perfectly, sure that the problem is that maybe you are not including the quality data in your data test or are not included in the correct line, because all the fastq files have each 4 lines the nucleotide sequence....