I am performing the reads counting using the function
summarizeOverlaps (GenomicAlignments_1.20.1 and Rsamtools_2.0.3). My RNA-seq libraries are stranded and paired-end. To count the number of reads assigned to the genes I used the following command:
se <- summarizeOverlaps(features=exonsByGene, reads=bamfiles, mode="Union", singleEnd=FALSE, fragments=TRUE, ignore.strand=FALSE, preprocess.reads=invertStrand)
I also performed the same analysis with
HTSeq-count with the option
-s reverse and
When I checked the counts of some genes I found that the number of counts are different between htseq-count and summarizeOverlaps. For example for two overlapping genes but that are on different strand I obtained the following values:
Gene_name Strand SummarizeOverlaps HTseq-count Gene_1 - 3929 80 Gene_2 + 575 3583
After that I performed again
fragments=FALSE, and I obtained different counts but now the number of counts are similar to HTSeq-count.
se <- summarizeOverlaps(features=exonsByGene, reads=bamfiles, mode="Union", singleEnd=FALSE, fragments=FALSE, ignore.strand=FALSE, preprocess.reads=invertStrand) Warning message: In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, : 28471 alignments with ambiguous pairing were dumped. Use 'getDumpedAlignments()' to retrieve them from the dump environment. #######counts obtained Gene_name Strand SummarizeOverlaps Gene_1 - 70 Gene_2 + 3231
Subsequently, I visualize the data on IGV and I observed few reads were mapped on the genes on reverse strand (gene 1) and many reads on the genes on forward strand (+). So for each gene, I extracted the reads mapped on gene 1 and gene 2 selecting reads with XS=- and XS=+ respectively. The number of reads that I obtained for each gene is in agreement with the number of counts obtained with HTSeq-count and summarizeOverlaps with fragment=FALSE: many reads on gene 2 and few reads on gene 1.
I was wondering if the function does not take into account the strand of the reads when I set fragment=TRUE.
In addition, do you suggest to use htseq-count or summarizeOverlaps?