How to calculate total and average read counts from paired-end and singleton FASTQ files?
1
2
Entering edit mode
8 weeks ago
Sachin ▴ 20

I have a metagenomic dataset consisting of paired-end and singleton FASTQ files, generated after host removal and quality filtering. Specifically:

  • Forward reads: *_R1.fastq or *_final_clean.1.fastq
  • Reverse reads: *_R2.fastq or *_final_clean.2.fastq
  • Singleton reads: *_single.fastq or *_final_clean_single.fastq

I would like to calculate:

  • Total number of reads
  • Average number of reads per sample

My questions are:

  1. Should I count only the forward reads (R1) to represent paired-end read counts, or do I need to include both R1 and R2?
  2. How should singleton files be handled in this calculation?
  3. Is there a recommended way or tool (e.g., seqkit) to do this accurately while skipping empty or corrupted files?
  4. Also, some .fastq files might be 0 bytes or corrupted — how can I avoid including those in the calculation?

If you have code or any example please share with me.

Thanks

Metagenomics • 635 views
ADD COMMENT
0
Entering edit mode

Please stop adding Bioinformatics and Computational Biology as tags - every post on this site belongs to those categories.

ADD REPLY
2
Entering edit mode
8 weeks ago
GenoMax 154k

You can use BBTools or Seqkit for these types of things.

Total number of reads

$ reformat.sh in=ERR072246_1.fastq

No output stream specified.  To write to stdout, please specify 'out=stdout.fq' or similar.
Input is being processed as unpaired
Input:                          573106 reads            75076886 bases
Output:                         573106 reads (100.00%)  75076886 bases (100.00%)

OR

$ seqkit stats ERR072246_1.fastq
file               format  type  num_seqs     sum_len  min_len  avg_len  max_len
ERR072246_1.fastq  FASTQ   DNA    573,106  75,076,886      131      131      131

Average number of reads per sample

This does not make sense as requested. Are there more than one data file per sample?


Now for the questions

\1. Should I count only the forward reads (R1) to represent paired-end read counts, or do I need to include both R1 and R2?

If you have singleton reads left then they are no longer paired-end. In that case you should only include reads that have both pairs surviving AND are actually matching in BOTH files.

\2. How should singleton files be handled in this calculation?

Depends on you.

\3. Is there a recommended way or tool (e.g., seqkit) to do this accurately while skipping empty or corrupted files?

You have empty/corrupted files? That sounds like something went wrong during processing (at least for corrupted files).

\4. Also, some .fastq files might be 0 bytes or corrupted — how can I avoid including those in the calculation?

You can find files that are 0 bytes (in current directory) doing something like

$ find . -type f -size 0

Make a list and exclude those files (or remove them from the data directory).

ADD COMMENT
0
Entering edit mode

Thank you for your reply. I have 540 fastq files + 270 singletons fastq files. That's why I need to calculate the average per reads.

ADD REPLY
0
Entering edit mode

That's why I need to calculate the average per reads.

What does "average per reads" mean? Not sure what you are going to "average" out over (entire set of files?) but you could simply use a for loop and something simple like the answer in this thread to get the read number for each file and then do whatever calculation you want to after that --> How to count fastq reads

ADD REPLY

Login before adding your answer.

Traffic: 3309 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6