Question: HTSeq-count [Python] is not properly counting reads like Genomic Alignments [R] ?
gravatar for 乙
2.3 years ago by
160 wrote:


I am performing RNA-seq using DESeq2 that requires un-normalized raw reads data as input from the BAM files. In order to simplify my problem, I extracted the ERG gene from the BAM file (sampleA.bam sorted by position) as follow:

  • Find the position ERG on the genome from GENCODE:

chr21 HAVANA gene 39751949 40033704 . - . gene_id "ENSG00000157554.18_3"; gene_type "protein_coding"; gene_name "ERG"; level 2; havana_gene "OTTHUMG00000090767.3_3"; remap_status "full_contig"; remap_num_mappings 1; remap_target_status "overlap";

  • Extract the region from the BAM:

    samtools view -h sampleA.bam "chr21:39751949-40033704" > ERG_sampleA.sam

  • Convert SAM to BAM

    samtools view -Sb ERG_sampleA.sam > ERG_sampleA.bam

  • Display on IGV with the counts to verify:


Zoom in:


So, we can expect to obtain reads when we count using HTSeq-count [Python] and GenomicAlignements [R]. PS: as per definition, in the case of RNA-Seq, the features are typically genes, where each gene is considered here as the union of all its exons.

Counting reads with HTSeq-count with Python

The following command is issued to count the raw reads of sampleA:

htseq-count -f bam -m union -i gene_id -r pos -s yes ERG_sampleA.bam gencode.v28lift37.annotation.gtf > erg.txt

and this is the following output (removed all IDs that had 0 reads):

ENSG00000157554.18_3    28              <- ERG gene
ENSG00000231480.1_3 275                 <- SNRPGP13 gene in ERG region
__no_feature    48474
__ambiguous 0
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  105

Counting with GenomicAlignements with R:

genomeAnnot <- "gencode.v28lift37.annotation.gtf"
txdb <- makeTxDbFromGFF(genomeAnnot, format = "gtf", circ_seqs = character())
ebg <- exonsBy(txdb, by="gene")
sampleA <- "ERG_sampleA.bam"
params <- ScanBamParam(what = scanBamWhat(), flag = scanBamFlag(isPaired = T, isProperPair = T, isUnmappedQuery = F, hasUnmappedMate = F, isSecondaryAlignment = F, isNotPassingQualityControls = F, isDuplicate = F), mapqFilter = 10)
countingReads <- summarizeOverlaps(features=ebg, reads=sampleA, mode="Union", singleEnd=FALSE, fragments=TRUE, ignore.strand=F, param = params)

and this is the following output:


One expects that we obtain similar number of counted reads between the two methods. So, I would highly appreciate if someone can give me an explanation on why there are such differences. Thank you in advance.

Other things I tried:

  • Downloaded from TCGA, for the same patient, the BAM file and the associated HTSeq-count file. I re-did the counting using the same files and using the same command provided by TCGA, I obtained exactly same result.
  • Changed the annotation file and used the ENSEMBL but was having 0 reads
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by 160

try adding --stranded=no to your htseq command and also in R, you are filtering reads by q-value (10). You can also try removing it when you are comparing.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by cpad011214k

I was trying to reproduce the your observations with commands you have furnished with a sample file I have. I made following changes:

  1. I looked for RPL3 gene instead of gene mentioned in OP
  2. I added --stranded=no to htseq function.
  3. Removed few params:

    params <- ScanBamParam(what = scanBamWhat(), flag = scanBamFlag(isUnmappedQuery = F, isSecondaryAlignment = F, isNotPassingQualityControls = F, isDuplicate = F))

Results are below. Please note that results exactly do not match. With few more param optimizations one can get matching numbers:

delete your account

how to delete account

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by cpad011214k

Thanks for the reply !

Your comment has solved my problem and as you have said, turning off stranded mode, I will obtain more reads counted on the gene. So, I would like to ask you why does this option outputs very few reads when my samples are sequenced in paired-end and stranded ?

Thanks a lot @cpad0112

ADD REPLYlink written 2.3 years ago by 160

copy/pasted from what is difference between strand specific in rnaseq analysis from htseq-count (upvote OP: Devon Ryan ) :

Which setting to use depends on how the sequencing libraries were prepared, though likely one of no or reverse are correct for anything sequenced in the past few years. no is appropriate for datasets that are not strand-specific. reverse is correct of libraries made with a dUTP-based method (this is the most common method). yes is appropriate for older datasets where read #1 indicates the strand of he original fragment sequenced. When in doubt, you can run all 3 on one sample. If the counts from yes and reverse are roughly equal for most genes then the dataset is unstranded (i.e., no is the correct setting). If either yes or reverse produces much higher counts than the other then the appropriate setting is the one giving the higher counts. This will pretty much always be reverse these days. You can also just ask the people who did the library prep. what kit they used or whether it was dUTP-based.

ADD REPLYlink written 2.3 years ago by cpad011214k

Another method is to use . This will give you a likelihood how the strandedness of the transcripts correlate with the strandedness of the reads.

ADD REPLYlink written 2.3 years ago by michael.ante3.6k

Thanks for replying.

But my samples are stranded so why should I put --stranded = no but I will give it a try.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by 160
gravatar for 乙
2.3 years ago by
160 wrote:

Sorry for the bump, someone can answer please ?

ADD COMMENTlink written 2.3 years ago by 160
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1682 users visited in the last hour