Question

HTSeq not counting all the fragments it should count on a gene

2

Entering edit mode

7.3 years ago

Matteo Schiavinato ★ 3.6k

Hi guys,

I'm having troubles understanding something which I though I knew well (hard times, these). I always knew that, while Cufflinks returns FPKM, expected counts and other things, HTSeq returns raw counts. This should mean that in the output of HTSeq I should have the counts of the fragments that were mapping on my particular gene. Right?

I am graphically and command-linely assessing the fact that I have indeed around 30 fragments mapping on a gene in the wild type condition and 0 in the treatment, but htseq returns only 1 in the wild type and 0 in the treatment. This gene is knocked-down, so I know it's down-regulated also from wet lab experiments.

EDIT: Three questions:

How does HTSeq handle reads that exceed borders of the considered regions but overlap them? My guess is that it doesn't count them.
I always thought that I understood correctly this: the count should just be a count of the fragments (in case of PE reads), without any normalization, right?
What the hell is going on?

The command is:

python /usr/bin/htseq-count -f bam -r pos -t exon -i "ID" alignment.bam gene-set.gff3 > stdout 2> stderr

Thank you in advance for the help!

RNA-Seq HTSeq gene expression counts • 3.6k views

ADD COMMENT • link 7.3 years ago by Matteo Schiavinato ★ 3.6k

1

Entering edit mode

Are these observed fragments uniquely mapping ones? HtSeq-count only uses uniquely mapped reads (with a certain mapping quality). AFAIK, you can check both with IGV.

ADD REPLY • link 7.3 years ago by michael.ante ★ 3.8k

0

Entering edit mode

Checking it RN. Thanks!

ADD REPLY • link 7.3 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

@michael.ante, something strange happens.

I filter the bam files keeping only primary alignments; samtools flagstat reveals, for one example file:

184518 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates

However, the output of htseq-count for that same read group, in the "__" lines is:

__no_feature 158109
__ambiguous 5
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 24730

Note: 158109+5+24730 = 182844 (99.09% of my reads!!!)

Now I have even more doubts. My htseq command is:

python /usr/bin/htseq-count -f bam -r pos -t exon -i gene_id file.bam gene-set.gtf > counts

The gtf file has CDS and exon lines and proper gene_id field (already checked). I don't know what to think.

ADD REPLY • link 7.3 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

I can add further information:

using stranded

__no_feature 158109
__ambiguous 5
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 24730

using unstranded

__no_feature 54856
__ambiguous 909
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 24730

There is a sensible difference between with and without the stranded option. However, my sequencing library is stranded and I was expecting to keep that information alongside the position. I'm disappointed. Any of you knows why this might happen? Am I missing something basic?

ADD REPLY • link 7.3 years ago by Matteo Schiavinato ★ 3.6k

1

Entering edit mode

The "__alignment_not_unique" are the counts of your multiple alignments (same holds true for samtools flagstat) - don't mix it up with the reads.

Try to use -s reverse e.g. TruSeq stranded generates reads of this flavour.

ADD REPLY • link 7.3 years ago by michael.ante ★ 3.8k

0

Entering edit mode

With the -s reverse option I got no significant difference (just a very little one, but not in the direction of an improvement). This, to me, means that I should have ran the analysis with reverse from the beginning, and that also running it without, in this case, doesn't alter much the results.

The strandedness was indeed the problem, and removing it worked just fine. Therefore, thanks for all the help!

ADD REPLY • link 7.3 years ago by Matteo Schiavinato ★ 3.6k

score 4 · Answer 1 · 2016-12-16

4

Entering edit mode

7.3 years ago

igor 13k

Have you seen the image on the htseq-count documenation page? http://www-huber.embl.de/HTSeq/doc/count.html

enter image description here

There are three overlap resolution modes. They handle partial overlaps differenly. You may also have reads that are overlapping multiple genes, in which case they may be ignored.

ADD COMMENT • link 7.3 years ago by igor 13k

0

Entering edit mode

Yes, I did see this picture, I chose to use union as a method because it was representing better the data I have. I edited the main question adding a sub-question. Perhaps you know something about it?

ADD REPLY • link 7.3 years ago by Matteo Schiavinato ★ 3.6k

score 2 · Answer 2 · 2016-12-16

2

Entering edit mode

7.3 years ago

GouthamAtla 12k

whats the output of 'head gene-set.gff3' ?

HTSeq assigns the reads to features even if they partially overlaps with annotations.

You should also check,

If the reads overlap multiple features. See the "expanded" view of annotations in IGV.
Check if the one of the read in PE reads do not overlap the annotations or the mapping qualities might be effecting.
Strandedness of read alignment.

I would suggest you to cross check by running featureCounts program which is better than HTSeq-count in handling PE reads and ambiguous overlaps.

ADD COMMENT • link 7.3 years ago by GouthamAtla 12k

0

Entering edit mode

The gff3 file is correct (it's my gene set and passed all validations, I know it doesn't have biases or at least not that I'm aware of). It could be the strandedness! Eventhough the library is stranded, there might be some problems I will try without. Thanks!

ADD REPLY • link 7.3 years ago by Matteo Schiavinato ★ 3.6k