Question

What Situations Fit The Different Overlap Resolution Modes Of Htseq-Count/Summarizeoverlaps?

3

Entering edit mode

12.0 years ago

Jeremy Leipzig 22k

Assigning reads to features is an important part of RNA-Seq. Because Simon Anders work in this area has now been implemented in both Python and R, it might be time to get a better understanding of this chart.

Can someone describe situations in which someone might choose a certain overlap resolution mode (or even roll their own) and why?

enter image description here

rna-seq • 3.1k views

ADD COMMENT • link updated 12.0 years ago by Ryan Dale 5.0k • written 12.0 years ago by Jeremy Leipzig 22k

score 5 · Answer 1 · 2012-05-07

I suppose it depends on how correct you assume your gene models to be. I tend to assume the gene models I use are not completely correct, so I use union mode for these reasons:

Accepting some "slop" around the gene (row 2 in the table you posted) allows for things like mis-annotated TSSs
Accepting cases like row 3 allows detection of unannotated isoforms or unspliced transcripts
The same assumption that the gene models are not totally accurate means that in the second-to-last row, it's possible that gene_B extends further into the read, which would make it a truly ambiguous read

That said, if you are interested in detection of isoform-specific expression of annotated isoforms then intersection_strict would probably be needed over union, and maybeintersection_nonempty if you don't care about the last point in the list above.

If you suspect there may be substantial DNA contamination in your RNA-seq data, it's possible that cases like row 3 will erroneously assign DNA reads to the gene. The easy fix would to switch to intersection_strict mode. If you wanted to keep the rest of the union mode logic though, I think you'd have to roll your own mode that keeps track of which bases in the read overlap a gene.