Question

How to calculate total and average read counts from paired-end and singleton FASTQ files?

2

Entering edit mode

3 months ago

Sachin ▴ 20

I have a metagenomic dataset consisting of paired-end and singleton FASTQ files, generated after host removal and quality filtering. Specifically:

Forward reads: *_R1.fastq or *_final_clean.1.fastq
Reverse reads: *_R2.fastq or *_final_clean.2.fastq
Singleton reads: *_single.fastq or *_final_clean_single.fastq

I would like to calculate:

Total number of reads
Average number of reads per sample

My questions are:

Should I count only the forward reads (R1) to represent paired-end read counts, or do I need to include both R1 and R2?
How should singleton files be handled in this calculation?
Is there a recommended way or tool (e.g., seqkit) to do this accurately while skipping empty or corrupted files?
Also, some .fastq files might be 0 bytes or corrupted — how can I avoid including those in the calculation?

If you have code or any example please share with me.

Thanks

Metagenomics • 798 views

ADD COMMENT • link updated 3 months ago by GenoMax 154k • written 3 months ago by Sachin ▴ 20

0

Entering edit mode

Please stop adding Bioinformatics and Computational Biology as tags - every post on this site belongs to those categories.

ADD REPLY • link 3 months ago by Ram 45k

score 2 · Answer 1 · 2025-08-05

You can use BBTools or Seqkit for these types of things.

Total number of reads

$ reformat.sh in=ERR072246_1.fastq

No output stream specified.  To write to stdout, please specify 'out=stdout.fq' or similar.
Input is being processed as unpaired
Input:                          573106 reads            75076886 bases
Output:                         573106 reads (100.00%)  75076886 bases (100.00%)

OR

$ seqkit stats ERR072246_1.fastq
file               format  type  num_seqs     sum_len  min_len  avg_len  max_len
ERR072246_1.fastq  FASTQ   DNA    573,106  75,076,886      131      131      131

Average number of reads per sample

This does not make sense as requested. Are there more than one data file per sample?

Now for the questions

\1. Should I count only the forward reads (R1) to represent paired-end read counts, or do I need to include both R1 and R2?

If you have singleton reads left then they are no longer paired-end. In that case you should only include reads that have both pairs surviving AND are actually matching in BOTH files.

\2. How should singleton files be handled in this calculation?

Depends on you.

\3. Is there a recommended way or tool (e.g., seqkit) to do this accurately while skipping empty or corrupted files?

You have empty/corrupted files? That sounds like something went wrong during processing (at least for corrupted files).

\4. Also, some .fastq files might be 0 bytes or corrupted — how can I avoid including those in the calculation?

You can find files that are 0 bytes (in current directory) doing something like

$ find . -type f -size 0

Make a list and exclude those files (or remove them from the data directory).