I'm analyzing paired-end expression libraries of E. coli and wanted to get some advise about how to count the number of reads for each gene. After mapping the reads to the genome I use htseq-count to generate a list of reads per gene. So far so good. The problem is that when using paired-end reads the entire fragment between the reads is taken into consideration as should be, but when a fragment overlaps two genes it's tagged as ambiguous and not counted for either of the genes. This could be a major problem in the case of small genes which are a part of a larger operon, and in adittion throws away about 5%-10% of the mapped reads.
I changed the script of htseq-count and added the option to count fragments that overlap more than one gene, for each of these genes. This significantly increased the number of reads per gene, especially for the reads that previously had a low number of reads.
My question is, is this okay? should I take it into consideration when analyzing DE genes? Obviously normalizing by total counts won't be applicable but I use DESeq so this shouldn't be an issue. Has anyone encountered this problem? have you had a better solution?