How to interpret the differences of read counts using htseq and feature counts?
1
1
Entering edit mode
15 months ago
2822462298 ▴ 60

Hi all,

To make it simple. I tried both htseq and featurecounts to generate read counts for differential expressed gene analyisis using deseq2. Here are example outputs for one of my samples:

1. Htseq (input bam file sorted by name using sortmeRNA) code: htseq-count -f bam -i Name ${i}_final_byname.bam B_anynana_v2.gff >${i}_count_byname.txt

.

result (first 10 rows):
BANY.1.2.t00001 74
BANY.1.2.t00002 0
BANY.1.2.t00003 4
BANY.1.2.t00004 1
BANY.1.2.t00005 0
BANY.1.2.t00007 4
BANY.1.2.t00009 25
BANY.1.2.t00010 302
BANY.1.2.t00011 2
BANY.1.2.t00012 32


.

summary:
__no_feature    22888655
__ambiguous 31338
__too_low_aQual 2391179
__not_aligned   3147296
__alignment_not_unique  739779

1. featurecounts (input bam file sorted by position using sortmeRNA) code: featureCounts -T 12 -p -g Name -a B_anynana_v2.gff -o ${i}_featurecounts_position.txt${i}_final.bam

.

result (first 10 rows, only gene names and counts are shown here)
BANY.1.2.t00001 846
BANY.1.2.t00002 0
BANY.1.2.t00003 65
BANY.1.2.t00004 3
BANY.1.2.t00005 0
BANY.1.2.t00007 185
BANY.1.2.t00009 446
BANY.1.2.t00010 2459
BANY.1.2.t00011 31
BANY.1.2.t00012 568


.

summary:
Assigned    18226666
Unassigned_Unmapped 3147296
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 1971565
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   8481719
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    130189


As I can see they both have the same number of reads that are not aligned so I assume both went fine. However, I still have some questions about this...

1. Why feature counts gave me significant higher counts for each gene, is it because of the multi-threaded method I used?
2. During DE gene analysis using deseq2, I got a lot of genes being filtered out due to the relatively low counts in htseq counted files. And eventually, I got totally different lists of differential expressed genes for htseq and featurecount...which result should I use and trust?

I'm new to RNA-seq analysis. Valuable suggestions are highly appreciated.

RNA-Seq rna-seq next-gen htseq featurecounts • 757 views
1
Entering edit mode

can you try changing parameters for s option with htseq ? Default for s (strandedness) is yes. @ 2822462298

0
Entering edit mode

See Comparison Htseq And Feature Count. The differences you report are surprisingly large and suggest a difference in the basic settings.

The featureCount results will be unchanged by the number of threads used, so multi-threading is not an issue.

0
Entering edit mode

I have read that post but I still did not find an explanation for the big gap between the two programs, I know it has something to do with my basic setting but I really don't know what's wrong with it since I did read the manuals very carefully...

0
Entering edit mode

Just to be sure, please show the output of samtools view -H your.bam | head -n 1). This should confirm if indeed name-sorted.

0
Entering edit mode

Just checked for all my unsorted, name-sorted, and position-sorted files, it shows

@HD VN:1.0  SO:unsorted
@HD VN:1.0  SO:queryname
@HD VN:1.0  SO:coordinate


respectively. I guess there is nothing wrong with the sorting.

0
Entering edit mode

Any ideas guys? :(((

0
Entering edit mode

Biostars, like most open science forums, is powered by volunteers who offer their time and expertise by volition and not by obligation. So, we cannot guarantee you an answer within a time period. We respond quickly, especially to questions posted on weekdays in the day.

0
Entering edit mode

Hi genomax, I understand and I really appreciate all the suggestions provided. I apologize if I may sound a little bit impatient.

0
Entering edit mode

This looks odd indeed. Could you try to run featurecounts again with the -B option to include only complete pairs? Pragmatically, I would still say, go with featureCounts then if it works for you.

0
Entering edit mode

Thanks Michael, I tried including the -B option but the counts only reduced a little, they are still much higher than htseq.

0
Entering edit mode
15 months ago
michael.ante ★ 3.6k

Hi ,

You are using the default values for quality filtering. Leading to different levels of stringency:

__too_low_aQual 2391179

vs

Unassigned_MappingQuality 0

The default mapping quality's minimum is 10 for htseq-count and 0 for featureCounts.

Using the same quality filter value should make your results more comparable. Other than that, please look at Gordon's link.

Regarding your second question, DESeq2 performs an "independent filtering step" searching for that "low count" threshold maximizing the amount of DE-reported genes. Having two different read sets will lead to two different "low counts" filtering in DESeq2.

Cheers,

Michael

0
Entering edit mode

Thanks Michael, I tried including -Q 10 in featurecounts and got "Unassigned_MappingQuality 27689", and the counts remain almost unchanged. Apparently, htseq somehow filtered out more reads other than quality filter.