featureCounts isoform vs gene summarization
3
1
Entering edit mode
7.0 years ago
lkmklsmn ▴ 950

I am somewhat new to RNAseq data and I have been using featureCounts from the subread package to summarize reads/fragments across genomic features (genes, transcripts).

In particular I am curious about what parameter choice you guys use regarding overlap. The default setting ignores reads which overlap more than one feature. However, when summarizing at the isoform level (e.g. UCSCid), this choice will ignore all reads mapping into exons shared between isoforms and lead to very low reads. At the isoform level it seems to be the better choice to use the non-default setting -O counting reads overlapping features for each feature. In a subsequent step one could choose the highest expressed isoform to represent a given gene.

At the gene level I think the default setting makes more sense. Here, however, you will sum reads across all isoforms, inflating the count of reads of any "true" single isoform or RNA species.

So what option do you usually use?

1) Summarize reads across isoforms -> choose highest expressed to represent gene

2) Summarize reads across genes

I just wanted to get a feeling for what others are doing regarding this choice.

RNAseq featureCounts • 6.5k views
0
Entering edit mode

The only way to get meaningful counts for isoforms is with an expectation-maximization method (e.g., Express or Flux Capacitor). There's no way around that.

1
Entering edit mode
7.0 years ago

I think of counting at two levels:

1. Gene level
2. Exon level

I do not think that simple counting at the isoform level using tools like featureCounts is likely to be practically very useful in human genomes for the reasons noted in the question.  Isoform deconvolution/expression estimation is not an easy problem and recent literature suggests that, while there are some pretty good tools, there is much need for improvement.

0
Entering edit mode
7.0 years ago
EagleEye 7.1k
I do not think using featurecount for isoform level is an good idea. You can try the tools discussed in this post: A: How to determine alternative splicing read counts
0
Entering edit mode
5.9 years ago

In featureCounts you can select "the range" to check against. For example you can set "the range" to be transcript coordinates. You can then you group by option to bin your reads into transcripts. You'd need to use -t transcript and -g transcript_id. I think (not too sure) you will also need to specify -f option to tell featureCounts to summarise you reads based on transcript level..

p.s I haven't actually tried that.. I did use those option to counts reads per exons.. but in theory I don't see why that wouldn't work on the transcript level..

3
Entering edit mode

I would be very hesitant to recommend that to anyone. What featureCounts will end up doing is either (A) ignoring alignments because they overlap multiple isoforms, thereby killing your statistical power or (B) counting a given alignment toward multiple isoforms, thereby screwing up the statistics. This is actually a great use-case for Salmon, which didn't yet exist when this question was posted.

0
Entering edit mode

Yes, this is exactly what I thought afterwards :) I did think I should count read towards both isoforms, but didn't know how to proceed from that.. I'll have look at Salmon sounds interesting.