Question: Counting reads over genes with summarizeOverlaps
gravatar for concetta
10 weeks ago by
concetta0 wrote:


I am performing the reads counting using the function summarizeOverlaps (GenomicAlignments_1.20.1 and Rsamtools_2.0.3). My RNA-seq libraries are stranded and paired-end. To count the number of reads assigned to the genes I used the following command:

se <- summarizeOverlaps(features=exonsByGene, reads=bamfiles, mode="Union", singleEnd=FALSE, fragments=TRUE, ignore.strand=FALSE, preprocess.reads=invertStrand)

I also performed the same analysis with HTSeq-count with the option -s reverse and mode unique.

When I checked the counts of some genes I found that the number of counts are different between htseq-count and summarizeOverlaps. For example for two overlapping genes but that are on different strand I obtained the following values:

Gene_name    Strand    SummarizeOverlaps       HTseq-count
Gene_1            -          3929                 80
Gene_2           +          575                   3583

After that I performed again summarizeOverlaps with fragments=FALSE, and I obtained different counts but now the number of counts are similar to HTSeq-count.

 se <- summarizeOverlaps(features=exonsByGene, reads=bamfiles, mode="Union", singleEnd=FALSE, fragments=FALSE, ignore.strand=FALSE, preprocess.reads=invertStrand)
    Warning message:
    In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode,  :
        28471 alignments with ambiguous pairing were dumped.
        Use 'getDumpedAlignments()' to retrieve them from the dump environment.

   #######counts obtained

    Gene_name    Strand    SummarizeOverlaps
        Gene_1            -          70 
        Gene_2           +          3231

Subsequently, I visualize the data on IGV and I observed few reads were mapped on the genes on reverse strand (gene 1) and many reads on the genes on forward strand (+). So for each gene, I extracted the reads mapped on gene 1 and gene 2 selecting reads with XS=- and XS=+ respectively. The number of reads that I obtained for each gene is in agreement with the number of counts obtained with HTSeq-count and summarizeOverlaps with fragment=FALSE: many reads on gene 2 and few reads on gene 1.

I was wondering if the function does not take into account the strand of the reads when I set fragment=TRUE.

In addition, do you suggest to use htseq-count or summarizeOverlaps?



ADD COMMENTlink written 10 weeks ago by concetta0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1518 users visited in the last hour