I have a metagenomic dataset consisting of paired-end and singleton FASTQ files, generated after host removal and quality filtering. Specifically:
- Forward reads:
*_R1.fastq
or*_final_clean.1.fastq
- Reverse reads:
*_R2.fastq
or*_final_clean.2.fastq
- Singleton reads:
*_single.fastq
or*_final_clean_single.fastq
I would like to calculate:
- Total number of reads
- Average number of reads per sample
My questions are:
- Should I count only the forward reads (R1) to represent paired-end read counts, or do I need to include both R1 and R2?
- How should singleton files be handled in this calculation?
- Is there a recommended way or tool (e.g., seqkit) to do this accurately while skipping empty or corrupted files?
- Also, some .fastq files might be 0 bytes or corrupted — how can I avoid including those in the calculation?
If you have code or any example please share with me.
Thanks
Please stop adding
Bioinformatics
andComputational Biology
as tags - every post on this site belongs to those categories.