I am using htseq-count on BAM files from a bacterial species. We are comparing WT strains as well as two strains with genes knocked out. The knockouts have been verified with whole genome sequencing, but do retain the first 15 and last 18 bp of the ORF. I used paired-end Illumina data to align. When using htseq-count, it records counts for these deleted genes. One gene has counts of ~95 (compared to ~6000 for WT) while the other gene has counts of ~28 (compared to ~160 for WT).
Viewing the alignments with BigWig files I see the majority of the gene is lacking all coverage, but the ends of the genes seem have reads that overlap slightly with the gene. I believe they are overlapping with the small amount of the ORF remaining and down/upstream sequence still present in the deletion strains. Previously, when using single-end data, htseq-count did not record any reads for the same deletion strains.
Because of the sequencing analysis and the RNA-seq alignment analysis, I know these genes are not present in the strains, and so the count results are misleading. We can manually adjust the count data to 0 but I am curious to know why we are seeing these count values for deleted genes.
Does anyone have a suggestion as to why we are seeing misleading gene counts with htseq-count for these two deleted genes using paired-end RNA-seq data? It seems misleading and any help on how to fix it would be appreciated.
htseq-count -f bam -s yes <bamfile> <gff>
HTSeq - version 2.0.3;
Python - version 3.10.12; operating system -
CentOS 7; PE reads aligned with
BWA version 0.7.17-r1188