I have a metagenomic dataset consisting of paired-end and singleton FASTQ files, generated after host removal and quality filtering. Specifically:

Forward reads: final_clean.1.fastq Reverse reads: _R2.fastq or _final_clean.2.fastq

Singleton reads: final_clean_single.fastq

I would like to calculate: Total number of reads Average number of reads per sample

My questions are:

1) Should I count only the forward reads (R1) to represent paired-end read counts, or do I need to include both R1 and R2?

2) How should singleton files be handled in this calculation?

3) Is there a recommended way or tool (e.g., seqkit) to do this accurately while skipping empty or corrupted files?

4) Also, some .fastq files might be 0 bytes or corrupted — how can I avoid including those in the calculation?

If you have code or any example please share with me.

Thanks.

More Sachin Chandrasekara's questions See All
Similar questions and discussions