Question: counting reads with featureCounts - uniquely assign reads to transcripts
0
gravatar for Pfs
3.8 years ago by
Pfs490
United States
Pfs490 wrote:

Hello, 

I have a question regarding featureCounts behavior.

Assume my GTF file has 1 gene, comprising 2 transcripts (G1T1, G1T2), with a total of 3 exons (E1,E2,E3): 
G1T1 consists of exons E1 and E2, while G1T2 consists of exons E3 and E2. 

Assume my SAM file has 4 reads, as shown below.
R1 overlaps E1;
R2 is spliced between E1 and E3;
R3 is spliced between E1 and E2;
R4 is spliced between E3 and E2.

|____E1____|                   |____E2____|   Transcript G1T1
               |____E3____|    |____E2____|   transcript G1T2

R1------>
R2---------....--->
R3---------....................---->
                   R4-----.....---->


When using featureCounts with option "-g transcript_id", only R1 is assigned and the counts for G1T1 is 1.
When using featureCoutns with option "-g transcript_id -O", all reads are assigned and the counts for G1T1 is 4 (R1,R2,R3,R4), whereas the count for G1T2 is 3 (R2,R3,R4).

I understand the logic behind allowing multiply overlapping features ("-O"), however I would expect that at the transcript level R3 is uniquely assigned to G1T1 and R4 uniquely assigned to G1T2. In other words, I would expect that even though the parts of a spliced read are mapped to multiple features, when the reads are counted at a level where a distinction between metafeatures is possible, this distinction is used to uniquely assign the reads.


I feel that without "-O", the counts are unnecessarily low and with the option "-O" the counts are unnecessarily inflated. What am I missing here?

 

Thanks in advance!

rna-seq counting reads • 2.7k views
ADD COMMENTlink modified 3.8 years ago by Devon Ryan89k • written 3.8 years ago by Pfs490
1
gravatar for Devon Ryan
3.8 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

One should really not use featureCounts (even with -O) to get transcript-level metrics. If you really want these numbers, use something like eXpress or another tool that has an expectation-maximization step.

BTW, what you're missing is that exon E2 is shared between both transcripts, so any alignment to it will necessarily by non-unique (aka ambiguous).

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Devon Ryan89k
0
gravatar for mark.ziemann
3.8 years ago by
mark.ziemann1.1k
Australia/Mebourne/Geelong/Deakin
mark.ziemann1.1k wrote:

Indeed, the counts can be quite low from featureCounts, but if you use the default gene-wise counting with stranded seq it is a bit better. The transcript-wise counting will be very strict without "-O", especially for genes with many splice types, so you might have better luck with counting based on exons first followed by DEXSeq differential analysis. In featureCounts, you can use the "-R" option to see the read assignment result for each read to help you fine-tune further. 

ADD COMMENTlink written 3.8 years ago by mark.ziemann1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 755 users visited in the last hour