HTSeq-count [Python] is not properly counting reads like Genomic Alignments [R] ?
1
0
Entering edit mode
2.9 years ago
▴ 180

Hello,

I am performing RNA-seq using DESeq2 that requires un-normalized raw reads data as input from the BAM files. In order to simplify my problem, I extracted the ERG gene from the BAM file (sampleA.bam sorted by position) as follow:

• Find the position ERG on the genome from GENCODE:

chr21 HAVANA gene 39751949 40033704 . - . gene_id "ENSG00000157554.18_3"; gene_type "protein_coding"; gene_name "ERG"; level 2; havana_gene "OTTHUMG00000090767.3_3"; remap_status "full_contig"; remap_num_mappings 1; remap_target_status "overlap";

• Extract the region from the BAM:

samtools view -h sampleA.bam "chr21:39751949-40033704" > ERG_sampleA.sam

• Convert SAM to BAM

samtools view -Sb ERG_sampleA.sam > ERG_sampleA.bam

• Display on IGV with the counts to verify:

Zoom in:

So, we can expect to obtain reads when we count using HTSeq-count [Python] and GenomicAlignements [R]. PS: as per definition, in the case of RNA-Seq, the features are typically genes, where each gene is considered here as the union of all its exons.

Counting reads with HTSeq-count with Python

The following command is issued to count the raw reads of sampleA:

htseq-count -f bam -m union -i gene_id -r pos -s yes ERG_sampleA.bam gencode.v28lift37.annotation.gtf > erg.txt


and this is the following output (removed all IDs that had 0 reads):

ENSG00000157554.18_3    28              <- ERG gene
ENSG00000231480.1_3 275                 <- SNRPGP13 gene in ERG region
__no_feature    48474
__ambiguous 0
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  105


Counting with GenomicAlignements with R:

genomeAnnot <- "gencode.v28lift37.annotation.gtf"
txdb <- makeTxDbFromGFF(genomeAnnot, format = "gtf", circ_seqs = character())
ebg <- exonsBy(txdb, by="gene")
sampleA <- "ERG_sampleA.bam"
params <- ScanBamParam(what = scanBamWhat(), flag = scanBamFlag(isPaired = T, isProperPair = T, isUnmappedQuery = F, hasUnmappedMate = F, isSecondaryAlignment = F, isNotPassingQualityControls = F, isDuplicate = F), mapqFilter = 10)


and this is the following output:

One expects that we obtain similar number of counted reads between the two methods. So, I would highly appreciate if someone can give me an explanation on why there are such differences. Thank you in advance.

Other things I tried:

• Downloaded from TCGA, for the same patient, the BAM file and the associated HTSeq-count file. I re-did the counting using the same files and using the same command provided by TCGA, I obtained exactly same result.
• Changed the annotation file and used the ENSEMBL but was having 0 reads
RNA-Seq HTSeq-count GenomicAlignments • 1.5k views
0
Entering edit mode

try adding --stranded=no to your htseq command and also in R, you are filtering reads by q-value (10). You can also try removing it when you are comparing.

1
Entering edit mode

I was trying to reproduce the your observations with commands you have furnished with a sample file I have. I made following changes:

1. I looked for RPL3 gene instead of gene mentioned in OP
2. I added --stranded=no to htseq function.
3. Removed few params:

params <- ScanBamParam(what = scanBamWhat(), flag = scanBamFlag(isUnmappedQuery = F, isSecondaryAlignment = F, isNotPassingQualityControls = F, isDuplicate = F))

Results are below. Please note that results exactly do not match. With few more param optimizations one can get matching numbers:

0
Entering edit mode

Your comment has solved my problem and as you have said, turning off stranded mode, I will obtain more reads counted on the gene. So, I would like to ask you why does this option outputs very few reads when my samples are sequenced in paired-end and stranded ?

2
Entering edit mode

copy/pasted from what is difference between strand specific in rnaseq analysis from htseq-count (upvote OP: Devon Ryan ) :

Which setting to use depends on how the sequencing libraries were prepared, though likely one of no or reverse are correct for anything sequenced in the past few years. no is appropriate for datasets that are not strand-specific. reverse is correct of libraries made with a dUTP-based method (this is the most common method). yes is appropriate for older datasets where read #1 indicates the strand of he original fragment sequenced. When in doubt, you can run all 3 on one sample. If the counts from yes and reverse are roughly equal for most genes then the dataset is unstranded (i.e., no is the correct setting). If either yes or reverse produces much higher counts than the other then the appropriate setting is the one giving the higher counts. This will pretty much always be reverse these days. You can also just ask the people who did the library prep. what kit they used or whether it was dUTP-based.

1
Entering edit mode

Another method is to use http://rseqc.sourceforge.net/#infer-experiment-py . This will give you a likelihood how the strandedness of the transcripts correlate with the strandedness of the reads.

0
Entering edit mode

But my samples are stranded so why should I put --stranded = no but I will give it a try.

0
Entering edit mode
2.9 years ago
▴ 180