Hi,
I'm analyzing paired-end expression libraries of E. coli and wanted to get some advise about how to count the number of reads for each gene. After mapping the reads to the genome I use htseq-count to generate a list of reads per gene. So far so good. The problem is that when using paired-end reads the entire fragment between the reads is taken into consideration as should be, but when a fragment overlaps two genes it's tagged as ambiguous and not counted for either of the genes. This could be a major problem in the case of small genes which are a part of a larger operon, and in adittion throws away about 5%-10% of the mapped reads.
I changed the script of htseq-count and added the option to count fragments that overlap more than one gene, for each of these genes. This significantly increased the number of reads per gene, especially for the reads that previously had a low number of reads.
My question is, is this okay? should I take it into consideration when analyzing DE genes? Obviously normalizing by total counts won't be applicable but I use DESeq so this shouldn't be an issue. Has anyone encountered this problem? have you had a better solution?
Thanks
Thanks Divon,
I used the default union mode, the dataset is strand-specific.
I took a look at featureCounts, it does have the -O option for counting reads for more than one feature but the fragment between the two reads is not counted so in the rare cases where there is a small gene surrounded by two other genes in the same transcript this gene will usually not be counted.
About eXpress, it seems that the main focus of this tool is to distinguish between alternative transcripts of the same gene. The problem is that the transcripts of E. coli are unannotated, only the ORFs. I'm afraid that it won't be able to deal with fragments that overlap some of the feature or several features, I didn't find any reference for this issue in the website or paper.
Thanks again
Do the bacteria you're working on use splicing? If so, I wouldn't count the fragment between the paired-ends. If not, then there's probably not much else you can do other than lose some data.
No splicing, I'm confident using the fragment between the reads. The major problem is not loosing reads but underestimating short genes.
I guess it becomes a question of whether things like this even really constitute different genes then, or whether it makes more sense to think of them as different transcripts of the same gene.
After looking at the code of htseq-count, it only treats the reads, not the entire fragment. I'll need to write my own script.