HTSeq not counting all the fragments it should count on a gene
2
2
Entering edit mode
4.8 years ago
Macspider ★ 3.4k

Hi guys,

I'm having troubles understanding something which I though I knew well (hard times, these). I always knew that, while Cufflinks returns FPKM, expected counts and other things, HTSeq returns raw counts. This should mean that in the output of HTSeq I should have the counts of the fragments that were mapping on my particular gene. Right?

I am graphically and command-linely assessing the fact that I have indeed around 30 fragments mapping on a gene in the wild type condition and 0 in the treatment, but htseq returns only 1 in the wild type and 0 in the treatment. This gene is knocked-down, so I know it's down-regulated also from wet lab experiments.

EDIT: Three questions:

1. How does HTSeq handle reads that exceed borders of the considered regions but overlap them? My guess is that it doesn't count them.

2. I always thought that I understood correctly this: the count should just be a count of the fragments (in case of PE reads), without any normalization, right?

3. What the hell is going on?

The command is:

python /usr/bin/htseq-count -f bam -r pos -t exon -i "ID" alignment.bam gene-set.gff3 > stdout 2> stderr


Thank you in advance for the help!

RNA-Seq HTSeq gene expression counts • 2.3k views
1
Entering edit mode

Are these observed fragments uniquely mapping ones? HtSeq-count only uses uniquely mapped reads (with a certain mapping quality). AFAIK, you can check both with IGV.

0
Entering edit mode

Checking it RN. Thanks!

0
Entering edit mode

@michael.ante, something strange happens.

I filter the bam files keeping only primary alignments; samtools flagstat reveals, for one example file:

• 0 + 0 secondary
• 0 + 0 supplementary
• 0 + 0 duplicates

However, the output of htseq-count for that same read group, in the "__" lines is:

• __no_feature 158109
• __ambiguous 5
• __too_low_aQual 0
• __not_aligned 0
• __alignment_not_unique 24730

Note: 158109+5+24730 = 182844 (99.09% of my reads!!!)

Now I have even more doubts. My htseq command is:

python /usr/bin/htseq-count -f bam -r pos -t exon -i gene_id file.bam gene-set.gtf > counts


The gtf file has CDS and exon lines and proper gene_id field (already checked). I don't know what to think.

0
Entering edit mode

using stranded

• __no_feature 158109
• __ambiguous 5
• __too_low_aQual 0
• __not_aligned 0
• __alignment_not_unique 24730

using unstranded

• __no_feature 54856
• __ambiguous 909
• __too_low_aQual 0
• __not_aligned 0
• __alignment_not_unique 24730

There is a sensible difference between with and without the stranded option. However, my sequencing library is stranded and I was expecting to keep that information alongside the position. I'm disappointed. Any of you knows why this might happen? Am I missing something basic?

1
Entering edit mode

The "__alignment_not_unique" are the counts of your multiple alignments (same holds true for samtools flagstat) - don't mix it up with the reads.

Try to use -s reverse e.g. TruSeq stranded generates reads of this flavour.

0
Entering edit mode

With the -s reverse option I got no significant difference (just a very little one, but not in the direction of an improvement). This, to me, means that I should have ran the analysis with reverse from the beginning, and that also running it without, in this case, doesn't alter much the results.

The strandedness was indeed the problem, and removing it worked just fine. Therefore, thanks for all the help!

4
Entering edit mode
4.8 years ago
igor 12k

Have you seen the image on the htseq-count documenation page? http://www-huber.embl.de/HTSeq/doc/count.html

There are three overlap resolution modes. They handle partial overlaps differenly. You may also have reads that are overlapping multiple genes, in which case they may be ignored.

0
Entering edit mode

Yes, I did see this picture, I chose to use union as a method because it was representing better the data I have. I edited the main question adding a sub-question. Perhaps you know something about it?

2
Entering edit mode
4.8 years ago

whats the output of 'head gene-set.gff3' ?

HTSeq assigns the reads to features even if they partially overlaps with annotations.

You should also check,

1. If the reads overlap multiple features. See the "expanded" view of annotations in IGV.
2. Check if the one of the read in PE reads do not overlap the annotations or the mapping qualities might be effecting.

I would suggest you to cross check by running featureCounts program which is better than HTSeq-count in handling PE reads and ambiguous overlaps.

0
Entering edit mode

The gff3 file is correct (it's my gene set and passed all validations, I know it doesn't have biases or at least not that I'm aware of). It could be the strandedness! Eventhough the library is stranded, there might be some problems I will try without. Thanks!