Question

Counting reads over genes with summarizeOverlaps

0

Entering edit mode

3.8 years ago

concetta ▴ 10

Hi!

I am performing the reads counting using the function summarizeOverlaps (GenomicAlignments_1.20.1 and Rsamtools_2.0.3). My RNA-seq libraries are stranded and paired-end. To count the number of reads assigned to the genes I used the following command:

se <- summarizeOverlaps(features=exonsByGene, reads=bamfiles, mode="Union", singleEnd=FALSE, fragments=TRUE, ignore.strand=FALSE, preprocess.reads=invertStrand)

I also performed the same analysis with HTSeq-count with the option -s reverse and mode unique.

When I checked the counts of some genes I found that the number of counts are different between htseq-count and summarizeOverlaps. For example for two overlapping genes but that are on different strand I obtained the following values:

Gene_name    Strand    SummarizeOverlaps       HTseq-count
Gene_1            -          3929                 80
Gene_2           +          575                   3583

After that I performed again summarizeOverlaps with fragments=FALSE, and I obtained different counts but now the number of counts are similar to HTSeq-count.

 se <- summarizeOverlaps(features=exonsByGene, reads=bamfiles, mode="Union", singleEnd=FALSE, fragments=FALSE, ignore.strand=FALSE, preprocess.reads=invertStrand)
    Warning message:
    In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode,  :
        28471 alignments with ambiguous pairing were dumped.
        Use 'getDumpedAlignments()' to retrieve them from the dump environment.

   #######counts obtained

    Gene_name    Strand    SummarizeOverlaps
        Gene_1            -          70 
        Gene_2           +          3231

Subsequently, I visualize the data on IGV and I observed few reads were mapped on the genes on reverse strand (gene 1) and many reads on the genes on forward strand (+). So for each gene, I extracted the reads mapped on gene 1 and gene 2 selecting reads with XS=- and XS=+ respectively. The number of reads that I obtained for each gene is in agreement with the number of counts obtained with HTSeq-count and summarizeOverlaps with fragment=FALSE: many reads on gene 2 and few reads on gene 1.

I was wondering if the function does not take into account the strand of the reads when I set fragment=TRUE.

In addition, do you suggest to use htseq-count or summarizeOverlaps?

Thanks!

Concetta

RNA-Seq summarizeOverlaps HTSeq-count • 1.3k views

ADD COMMENT • link 3.8 years ago by concetta ▴ 10