Question: How to identify RNA-seq mapped reads to reference genome but not identified by featurecounts of Subread package?
gravatar for bk11
11 weeks ago by
bk1130 wrote:


We have performed RNA-seq of 60 sorted T cells from human and aligned to the reference genome (GRCh38) using HISAT2. About 80% of reads were mapped to the reference genome. However, when featureCounts of Subread package was used to count mapped reads, only 20-25% of mapped reads were counted to be associated with feature provided (exon and gene_id) of the reference gtf file (Homo_sapiens.GRCh38.87.gtf). I do not know what is happening here. I have two questions:

  1. First what can be possible reason for this observation?
  2. Is it possible that we can identify to which region of the reference genome the rest of the reads mapped?

I will appreciate your help.

hisat2 rna-seq featurecounts • 222 views
ADD COMMENTlink modified 11 weeks ago by kristoffer.vittingseerup1.4k • written 11 weeks ago by bk1130

Another recommendation, check if you used the proper strand selection in feature counts. Try what happens with different strand selectors -s0 -s1 and -s2 in featureCount.

ADD REPLYlink written 11 weeks ago by Michael Dondrup45k

When I changed the strand selector from -s2 to -s0 in featureCount, the assigned read count were doubled i.e. from 10-12% to 20-25%. It is still very low exonic reads.

ADD REPLYlink written 10 weeks ago by bk1130

That sounds suspiciously low.

  • Did you include ribosomal RNA genes in the annotation?
  • double check the annotation version matches the genome
  • was the gtf file truncated?
  • check what happens when you count whole transcripts instead of exon, including intronic regions
  • Indentify intergenic regions of high coverage and inspect them in IGV, etc. what is there?
  • to do this quickly you could for example 'invert' your gtf file to contain all gaps between exons and genes and run feature counts again
ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Michael Dondrup45k
gravatar for kristoffer.vittingseerup
11 weeks ago by
European Union
kristoffer.vittingseerup1.4k wrote:

I agree with you - im my experience 20-25% exonic reads are quite low.

Answer 1: There are multiple reasons you can observe a low exonic mapping rate. The main once are DNA contamination and a large number of pre-mRNA, but these are just a few. Another (more technical) is if you quantified your data as strand specific while it is not actually strand specific.

Answer 2: Yes - I would simply use featureCount with the Rsubread R package. Via this R package it is easy to define which regions featureCounts should quantify yourself. I would suggest distinguishing between exonic, intronic and intergenic reads.

Cheers Kristoffer

ADD COMMENTlink written 11 weeks ago by kristoffer.vittingseerup1.4k

Hi, Earlier I defined exon and gene_id features of reference gtf file in the featureCount. It looks like other regions that can be defined are: CDS, five_prime_utr, gene, Selenocysteine, start_codon, stop_codon, three_prime_utr and transcript.

How can I define intronic and intergenic reads?

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by bk1130

The simplest way is to define introns as the regions between exon and the intergenic regions as those between genes. If you are actually looking for code to do this you should make a new question.

ADD REPLYlink written 10 weeks ago by kristoffer.vittingseerup1.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2044 users visited in the last hour