Entering edit mode
7.5 years ago
gudraephouto
▴
10
After counting reads, among 20 million reads for each sample, about only 10 million of reads are counted. Is it logical? This is my pipeline:
* 1-Fastq to bam (HiSat, hg38):
20027937 reads; of these:
20027937 (100.00%) were unpaired; of these:
3671767 (18.33%) aligned 0 times
11149914 (55.67%) aligned exactly 1 time
5206256 (25.99%) aligned >1 times
81.67% overall alignment rate
[bam_sort_core] merging from 12
* 2-Bam to read counts (Htseq, ensemble V25)
__no_feature 1817050
__ambiguous 1350078
__too_low_aQual 1378909
__not_aligned 3671767
__alignment_not_unique 3308485
I'm confused, all your reads are accounted for no?
3671767 + 11149914 + 5206256 = 20027937
Do you mean why do only 11149914 (55%) align exactly 1 time?
Not accounted but counted as in HTseq count?
HTseq-count result = 11526289
No htseq count result is 10522274
No, I mean why htseq could count only half of my reads: total number of read counts=10522274
There is no an "assigned" category in your HTseqCount output (no_feature, ambiguous, too_low_aQual, not_aligned alignment_not_unique).
So the difference is made of the reads that are actually assigned to genes.
Exactly... My question is if it is alright that only 10522274 of 20027937 are assigned to genes. Isn't that low for further analysis?
Not necessarily. Since you have a number of reads (25%) that are multimapping they are not being counted.