Entering edit mode
8.3 years ago
cafelumiere12 ▴ 80
I'm trying to figure out if there's anything wrong with my htseq-count result. I started with a sam file converted from a name-sorted bam file, then use the following options:
python -m HTSeq.scripts.count -m intersection-strict -s no -r name $nsorted_aligned_sam $refGtf
Htseq-count ran fine and showed that there were:
32268229 SAM alignment pairs processed.
When I looked into the count result:
__no_feature 11327775 __ambiguous 7618389 __too_low_aQual 0 __not_aligned 844654 __alignment_not_unique 4974959
Is this too high of a number of
no_feature? I checked that the gtf file has the same chr formatting as the sam file, and that the stranded option is correct. Am I missing something here?
Any idea is appreciated. Thanks in advance!
did you check whether your sam file is sorted ?
Yes - I name sorted the bam file then converted to sam
what you get if try with method=union
That's not normal, you're only able to get counts from around a quarter of the alignments. Have a look at the BAM file in IGV. See if you have lots of alignments to intergenic regions and then look at those regions in the UCSC genome browser. It could be that you just have a large amount of tRNA or repeats being expressed. Alternatively, perhaps you just have a bunch of DNA contamination. In any case, looking at the actual alignments would be a good place to start.
Got it. Thanks a lot!
So I re-ran HTSeq-count with the following modifications: (1) I used another gtf file (2)
-m union. Now I'm getting about 53% of the alignments mapped to feature:
Is this more in the "acceptable " range?
A quick look at the counting result did show a lot less zero counts.
I still have to look at the bam file in IGV - for some reason it's not showing me anything on IGV- probably something really simple I have to change on the genome browser.
That's certainly better. Realistically, the percentage of alignments that are to features will depend on the experiment and organism. If you're not using something like mouse/human/rat then the annotation may not be that great, so you'd expect a considerable number of
__no_featurealignments. Given the large change in numbers just by using a different GTF, that may be the case here. In any case, it's always good to have a quick look at things in IGV anyway, just to ensure that there's nothing obvious you missed.
Thanks a lot! So I checked the data on IGV. I am not sure if there is any suggested efficient way of checking the alignments. I did some point checking with the counts file I have and it seems to make sense.
The other thing that made me feel better was that I also got an alignment summary which calculates the percentage of reads mapped to intergenic and coding region. This set of samples have an average of about 30% reads mapped to intergenic region.