After aligning paired-end 100bp reads to a reference genome, I am getting very low properly paired percentage:
369208441 0 total (QC-passed reads + QC-failed reads) 8985531 0 secondary 289733341 0 mapped 78.47% N/A mapped % 360222910 0 paired in sequencing 180111455 0 read1 180111455 0 read2 1393338 0 properly paired 0.39% N/A properly paired % 280747810 0 with itself and mate mapped 0 0 singletons 0.00% N/A singletons % 39590468 0 with mate mapped to a different chr 0 0 with mate mapped to a different chr (mapQ>=5)
I followed GATK best practices to align paired-end short-read data to a reference genome. I downloaded the short-read data from NCBI SRA into fastq files using SRA toolkit's fastq-dump, converted the fastq files into unmapped bam using Picard FastqToSam, and marked adapters using Picard MarkIlluminaAdapters. I then piped Picard SamToFastq, bwa mem, and Picard MergeBamAlignment. To get stats on the alignment, I used samtools flagstat. For several of my samples, the alignment went great (90% mapped, 80% properly paired). However, for a couple of my samples, the properly paired percentage was well below 1%. I'm wondering how I could have a normal amount of reads mapping (~78%) but have only .39% of those reads properly paired.
I have double-checked that my fastq files from fastq-dump have identical read counts, and that they are properly interleaved after Picard FastqToSam. I additionally ran Picard ValidateSamFile to troubleshoot the file output by MergeBamAlignment and found no errors.