Question: Counting Number Of Bases In A Fastq File
2
gravatar for DoubleDecker
5.1 years ago by
DoubleDecker140
United Kingdom
DoubleDecker140 wrote:

I am sure there must be a tool out there that does it and does it fast? Parsing each file with a custom script is an option but I have big files and want something efficient. Too bad FastQC does not seem to provide this option.

fastq ngs • 16k views
ADD COMMENTlink modified 3.1 years ago by billzt20 • written 5.1 years ago by DoubleDecker140
1

Is the read length the same for all entries in the fastq file? In that case, you could just count the number of lines with a unix command like 'wc -l' which is probably as fast as it gets. Then divide the result by 4 and multiply by the read length.

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by KCC3.9k
7
gravatar for Lars
3.7 years ago by
Lars560
Cologne
Lars560 wrote:
zcat file.fq.gz | paste - - - - | cut -f2 | wc -c

zcat: print gzipped file (if the file is not zipped, just use cat)

paste - - - -: print four consecutive lines in one row (tab delimited)

cut -f2: print only the second column (after paste this is the second line of the fastq-format, meaning the sequence)

wc -c: count the characters

ADD COMMENTlink written 3.7 years ago by Lars560
9

This will also count newline characters, remove those first.

cat test.fastq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c 

 

ADD REPLYlink written 3.6 years ago by bv21390
3
gravatar for Gabriel R.
5.1 years ago by
Gabriel R.2.5k
Center for Geogenetik Københavns Universitet
Gabriel R.2.5k wrote:
awk 'BEGIN{sum=0;}{if(NR%4==2){sum+=length($0);}}END{print sum;}'  file.fq
ADD COMMENTlink modified 5.1 years ago by Eric Normandeau9.9k • written 5.1 years ago by Gabriel R.2.5k

Thanks but I am a little bit unsure if this works 100% properly... As a test, I had a look at a raw fastq file which I run with FastQC. FastQC claims that all the reads are 100 bp, so I calculated the number of bases by simply multiplying the number of reads by 100. The number I get from doing so is 676,539 bp (994,675,300 ver 993,998,761) bigger than the one from running your awk command on the file. Do you have any idea about the source of this discrepancy?

ADD REPLYlink written 5.1 years ago by DoubleDecker140

I only tested it on a small fastq, try awk '{ if(NR%4==2){print length($0);} }' file.fq

pipe that into uniq -c to check if indeed every seq is 100bp in length

ADD REPLYlink written 5.1 years ago by Gabriel R.2.5k

It is important to mention that this code assumes that each sequence in the fastq file uses EXACTLY 4 lines, otherwise it will not work.

ADD REPLYlink written 5.1 years ago by Eric Normandeau9.9k
1

Isn't this the specification for the FASTQ format?

ADD REPLYlink written 5.1 years ago by Alex Reynolds25k

This may be the reason for my missing bases - I ran the second awk command suggested by Samuel and all my reads are 100 bp long.

ADD REPLYlink written 5.1 years ago by DoubleDecker140

then you should not have any discrepency. Do you have white spaces or something ?

ADD REPLYlink written 5.1 years ago by Gabriel R.2.5k
2
gravatar for Eric Normandeau
5.1 years ago by
Eric Normandeau9.9k
Quebec, Canada
Eric Normandeau9.9k wrote:

I'd probably go with something simple like:

grep -E '^[ACTGN]+$' | perl -pe 's/[[:space:]]//g' | wc -c

The assumption here is that you want to count all characters on all lines that contain only one of ACTG or N.

You can also use fastx_quality_stats from the fastx toolkit. It reports the total number of bases, among other things.

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Eric Normandeau9.9k
3

Hi Eric, if you avoid the extended regex, it is twice faster (on my test file). If also think using the -m option of wc would be better (at least semantically), but it slows down the computation: grep "^[ACGTN]" test.fastq | tr -d "\n" | wc -m

ADD REPLYlink written 5.1 years ago by Frédéric Mahé2.8k

Cool, I never thought about the added cost of using extended regular expressions. However, your code is not as safe if you test only for the first character on the line. Fastq files vary and some may even have the name of the sequences starting with one of ACTGN, in which case you would end up counting the characters of unwanted lines. I tend to go with safer over faster. I find it actually saves a LOT of time down the line.

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Eric Normandeau9.9k

You're right. The problem here is the lack of normalization for fastq files. Parsing fastq should be much more straighforward.

ADD REPLYlink written 5.1 years ago by Frédéric Mahé2.8k

Thanks it works. I'd like just add that - since most frequently fastq files are gzipped - zgrep is alternative in such instances.

ADD REPLYlink written 2.1 years ago by boczniak767610

Thanks, Eric - your snippet does the job well with my raw dataset .

ADD REPLYlink written 5.1 years ago by DoubleDecker140
1
gravatar for ramarquezo
3.7 years ago by
ramarquezo10
Australia
ramarquezo10 wrote:

This assumption is not correct because many quality lines start by letter like CGAT or N, then you are adding to the count the characters from the lines of the quality that start with this values, remember that the ASCII code include all the letters of the alphabet!!!! The wak Samuel's script works perfectly, sure that the problem is that maybe you are not including the quality data in your data test or are not included in the correct line, because all the fastq files have each 4 lines the nucleotide sequence....

ADD COMMENTlink written 3.7 years ago by ramarquezo10

Correct; the only safe approach is to require 4-line properly-formatted fastq.  reformat.sh uses this approach, and will additionally verify certain properties like "@" and "#" symbols being in the right places, and the number of bases equaling the number of quality scores, to help ensure the input is valid.

$ reformat.sh in=reads.fq
Input:                          4000000 reads           297852680 bases

Time:                           2.339 seconds.

ADD REPLYlink written 3.7 years ago by Brian Bushnell15k
0
gravatar for billzt
3.1 years ago by
billzt20
Australia
billzt20 wrote:

Please try this: https://github.com/billzt/readfq Smart and Fast!!!

ADD COMMENTlink written 3.1 years ago by billzt20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1056 users visited in the last hour