I have a metagenomic dataset consisting of paired-end and singleton FASTQ files, generated after host removal and quality filtering. Specifically:
Forward reads: final_clean.1.fastq Reverse reads: _R2.fastq or _final_clean.2.fastq
Singleton reads: final_clean_single.fastq
I would like to calculate: Total number of reads Average number of reads per sample
My questions are:
1) Should I count only the forward reads (R1) to represent paired-end read counts, or do I need to include both R1 and R2?
2) How should singleton files be handled in this calculation?
3) Is there a recommended way or tool (e.g., seqkit) to do this accurately while skipping empty or corrupted files?
4) Also, some .fastq files might be 0 bytes or corrupted — how can I avoid including those in the calculation?
If you have code or any example please share with me.
Thanks.