Question: HTSeq not counting all the fragments it should count on a gene
2
gravatar for Macspider
3.0 years ago by
Macspider3.0k
Vienna - BOKU
Macspider3.0k wrote:

Hi guys,

I'm having troubles understanding something which I though I knew well (hard times, these). I always knew that, while Cufflinks returns FPKM, expected counts and other things, HTSeq returns raw counts. This should mean that in the output of HTSeq I should have the counts of the fragments that were mapping on my particular gene. Right?

I am graphically and command-linely assessing the fact that I have indeed around 30 fragments mapping on a gene in the wild type condition and 0 in the treatment, but htseq returns only 1 in the wild type and 0 in the treatment. This gene is knocked-down, so I know it's down-regulated also from wet lab experiments.

EDIT: Three questions:

  1. How does HTSeq handle reads that exceed borders of the considered regions but overlap them? My guess is that it doesn't count them.

  2. I always thought that I understood correctly this: the count should just be a count of the fragments (in case of PE reads), without any normalization, right?

  3. What the hell is going on?

The command is:

python /usr/bin/htseq-count -f bam -r pos -t exon -i "ID" alignment.bam gene-set.gff3 > stdout 2> stderr

Thank you in advance for the help!

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Macspider3.0k
1

Are these observed fragments uniquely mapping ones? HtSeq-count only uses uniquely mapped reads (with a certain mapping quality). AFAIK, you can check both with IGV.

ADD REPLYlink written 3.0 years ago by michael.ante3.5k

Checking it RN. Thanks!

ADD REPLYlink written 3.0 years ago by Macspider3.0k

@michael.ante, something strange happens.

I filter the bam files keeping only primary alignments; samtools flagstat reveals, for one example file:

  • 184518 + 0 in total (QC-passed reads + QC-failed reads)
  • 0 + 0 secondary
  • 0 + 0 supplementary
  • 0 + 0 duplicates

However, the output of htseq-count for that same read group, in the "__" lines is:

  • __no_feature 158109
  • __ambiguous 5
  • __too_low_aQual 0
  • __not_aligned 0
  • __alignment_not_unique 24730

Note: 158109+5+24730 = 182844 (99.09% of my reads!!!)

Now I have even more doubts. My htseq command is:

python /usr/bin/htseq-count -f bam -r pos -t exon -i gene_id file.bam gene-set.gtf > counts

The gtf file has CDS and exon lines and proper gene_id field (already checked). I don't know what to think.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Macspider3.0k

I can add further information:

using stranded

  • __no_feature 158109
  • __ambiguous 5
  • __too_low_aQual 0
  • __not_aligned 0
  • __alignment_not_unique 24730

using unstranded

  • __no_feature 54856
  • __ambiguous 909
  • __too_low_aQual 0
  • __not_aligned 0
  • __alignment_not_unique 24730

There is a sensible difference between with and without the stranded option. However, my sequencing library is stranded and I was expecting to keep that information alongside the position. I'm disappointed. Any of you knows why this might happen? Am I missing something basic?

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Macspider3.0k
1

The "__alignment_not_unique" are the counts of your multiple alignments (same holds true for samtools flagstat) - don't mix it up with the reads.

Try to use -s reverse e.g. TruSeq stranded generates reads of this flavour.

ADD REPLYlink written 3.0 years ago by michael.ante3.5k

With the -s reverse option I got no significant difference (just a very little one, but not in the direction of an improvement). This, to me, means that I should have ran the analysis with reverse from the beginning, and that also running it without, in this case, doesn't alter much the results.

The strandedness was indeed the problem, and removing it worked just fine. Therefore, thanks for all the help!

ADD REPLYlink written 3.0 years ago by Macspider3.0k
4
gravatar for igor
3.0 years ago by
igor8.9k
United States
igor8.9k wrote:

Have you seen the image on the htseq-count documenation page? http://www-huber.embl.de/HTSeq/doc/count.html

enter image description here

There are three overlap resolution modes. They handle partial overlaps differenly. You may also have reads that are overlapping multiple genes, in which case they may be ignored.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by igor8.9k

Yes, I did see this picture, I chose to use union as a method because it was representing better the data I have. I edited the main question adding a sub-question. Perhaps you know something about it?

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Macspider3.0k
2
gravatar for geek_y
3.0 years ago by
geek_y10k
Barcelona
geek_y10k wrote:

whats the output of 'head gene-set.gff3' ?

HTSeq assigns the reads to features even if they partially overlaps with annotations.

You should also check,

  1. If the reads overlap multiple features. See the "expanded" view of annotations in IGV.
  2. Check if the one of the read in PE reads do not overlap the annotations or the mapping qualities might be effecting.
  3. Strandedness of read alignment.

I would suggest you to cross check by running featureCounts program which is better than HTSeq-count in handling PE reads and ambiguous overlaps.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by geek_y10k

The gff3 file is correct (it's my gene set and passed all validations, I know it doesn't have biases or at least not that I'm aware of). It could be the strandedness! Eventhough the library is stranded, there might be some problems I will try without. Thanks!

ADD REPLYlink written 3.0 years ago by Macspider3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2062 users visited in the last hour