Question: HTSeq-count high number of no_feature
1
gravatar for cafelumiere12
4.4 years ago by
United States
cafelumiere1270 wrote:

Hi all,

I'm trying to figure out if there's anything wrong with my htseq-count result. I started with a sam file converted from a name-sorted bam file, then use the following options:

python -m HTSeq.scripts.count -m intersection-strict -s no -r name $nsorted_aligned_sam $refGtf

Htseq-count ran fine and showed that there were:

32268229 SAM alignment pairs processed.

When I looked into the count result:

__no_feature    11327775
__ambiguous     7618389
__too_low_aQual 0
__not_aligned   844654
__alignment_not_unique  4974959

 

Is this too high of a number of no_feature? I checked that the gtf file has the same chr formatting as the sam file, and that the stranded option is correct. Am I missing something here?

Any idea is appreciated. Thanks in advance!

 

rna-seq htseq-count • 7.8k views
ADD COMMENTlink written 4.4 years ago by cafelumiere1270

did you check whether your sam file is sorted ?

ADD REPLYlink written 4.4 years ago by Manvendra Singh2.1k

Yes - I name sorted the bam file then converted to sam

ADD REPLYlink written 4.4 years ago by cafelumiere1270

what you get if try with method=union 

ADD REPLYlink written 4.4 years ago by Manvendra Singh2.1k

That's not normal, you're only able to get counts from around a quarter of the alignments. Have a look at the BAM file in IGV. See if you have lots of alignments to intergenic regions and then look at those regions in the UCSC genome browser. It could be that you just have a large amount of tRNA or repeats being expressed. Alternatively, perhaps you just have a bunch of DNA contamination. In any case, looking at the actual alignments would be a good place to start.

ADD REPLYlink written 4.4 years ago by Devon Ryan90k

Got it. Thanks a lot!

ADD REPLYlink written 4.4 years ago by cafelumiere1270

So I re-ran HTSeq-count with the following modifications: (1) I used another gtf file (2) -m union. Now I'm getting about 53% of the alignments mapped to feature:

 

__no_feature    8716076
__ambiguous     534206
__too_low_aQual 0
__not_aligned   844654
__alignment_not_unique  4974959

Is this more in the "acceptable " range?

A quick look at the counting result did show a lot less zero counts. 

I still have to look at the bam file in IGV - for some reason it's not showing me anything on IGV- probably something really simple I have to change on the genome browser.

Thanks! 

ADD REPLYlink written 4.4 years ago by cafelumiere1270

That's certainly better. Realistically, the percentage of alignments that are to features will depend on the experiment and organism. If you're not using something like mouse/human/rat then the annotation may not be that great, so you'd expect a considerable number of __no_feature alignments. Given the large change in numbers just by using a different GTF, that may be the case here. In any case, it's always good to have a quick look at things in IGV anyway, just to ensure that there's nothing obvious you missed.

ADD REPLYlink written 4.3 years ago by Devon Ryan90k

Thanks a lot! So I checked the data on IGV. I am not sure if there is any suggested efficient way of checking the alignments. I did some point checking with the counts file I have and it seems to make sense.

The other thing that made me feel better was that I also got an alignment summary which calculates the percentage of reads mapped to intergenic and coding region. This set of samples have an average of about 30% reads mapped to intergenic region.

 

ADD REPLYlink written 4.3 years ago by cafelumiere1270
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1243 users visited in the last hour