Question: Counting reads over genes with summarizeOverlaps
gravatar for concetta
7 months ago by
concetta0 wrote:


I am performing the reads counting using the function summarizeOverlaps (GenomicAlignments_1.20.1 and Rsamtools_2.0.3). My RNA-seq libraries are stranded and paired-end. To count the number of reads assigned to the genes I used the following command:

se <- summarizeOverlaps(features=exonsByGene, reads=bamfiles, mode="Union", singleEnd=FALSE, fragments=TRUE, ignore.strand=FALSE, preprocess.reads=invertStrand)

I also performed the same analysis with HTSeq-count with the option -s reverse and mode unique.

When I checked the counts of some genes I found that the number of counts are different between htseq-count and summarizeOverlaps. For example for two overlapping genes but that are on different strand I obtained the following values:

Gene_name    Strand    SummarizeOverlaps       HTseq-count
Gene_1            -          3929                 80
Gene_2           +          575                   3583

After that I performed again summarizeOverlaps with fragments=FALSE, and I obtained different counts but now the number of counts are similar to HTSeq-count.

 se <- summarizeOverlaps(features=exonsByGene, reads=bamfiles, mode="Union", singleEnd=FALSE, fragments=FALSE, ignore.strand=FALSE, preprocess.reads=invertStrand)
    Warning message:
    In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode,  :
        28471 alignments with ambiguous pairing were dumped.
        Use 'getDumpedAlignments()' to retrieve them from the dump environment.

   #######counts obtained

    Gene_name    Strand    SummarizeOverlaps
        Gene_1            -          70 
        Gene_2           +          3231

Subsequently, I visualize the data on IGV and I observed few reads were mapped on the genes on reverse strand (gene 1) and many reads on the genes on forward strand (+). So for each gene, I extracted the reads mapped on gene 1 and gene 2 selecting reads with XS=- and XS=+ respectively. The number of reads that I obtained for each gene is in agreement with the number of counts obtained with HTSeq-count and summarizeOverlaps with fragment=FALSE: many reads on gene 2 and few reads on gene 1.

I was wondering if the function does not take into account the strand of the reads when I set fragment=TRUE.

In addition, do you suggest to use htseq-count or summarizeOverlaps?



ADD COMMENTlink written 7 months ago by concetta0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1703 users visited in the last hour