Featurecount is 0 while TPM is 6289.54
0
1
Entering edit mode
2.4 years ago
zhangmeng21 ▴ 10

Recently, I'm reading a paper and checked the attachment(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE159919). I found that read counts for gene ENSMUSG00000064356 in file GSE1599_PolyA_read_counts is 0 in sample mES_Chr_rep_1_polyA. While the TPM of this gene in file GSE159919_polyA_gene_expresion is 6289.54. That's confusing. Then I asked the author of this paper. He/She answered, "Both are correct due to the fundamental difference between feature count and Kallisto.". Can someone tell me is that possible? The feature count is 0 while TPM is more than 5,000. I checked the calculation method of TPM, and if the feature count is zero then the TPM should also be 0.

TPM Count Feature • 1.5k views
ADD COMMENT
2
Entering edit mode

yeah, sounds a doggy explanation of them to me as well

(on the other hand, I'm a salmon kinda guy, so .... ;) )

ADD REPLY
1
Entering edit mode

It is a gene on chrM that overlaps another gene in the same orientation. That is probably one of those edge cases where the featureCount defaults skip that gene (hence count 0) because of the ambiguity, and the pseudoaligners use their EM (or whatever stats magic) algorithm to still assign those reads to any (or both) of the features.

ENSMUSG00000064356 is mt-Atp8

enter image description here

ADD REPLY
1
Entering edit mode

would featurecount not 'count' the reads in the non-overlap part (if remember well that's how htseq-count will do it)? or will it skip it completely (apparently)?

ADD REPLY
1
Entering edit mode

Reads could also be multi-mapped which would cause featureCounts to omit them.

ADD REPLY
1
Entering edit mode

The screenshot is only 500bp, so that left gene is probably like 250bp. Depending on read length and library method maybe there is simply not left in terms of "truely unique" reads for that gene. Just thinking aloud.

ADD REPLY
1
Entering edit mode

There could also be two other possibilities. The reads being quantified to those genes by kallisto could actually map better to an intergenic region (in which case Salmon + selective alignment would have quantification closer to featureCounts). The second possibility is that there are duplications or repeats of this gene causing multimapping, in which case featureCounts ignores multimappers by default, but kallisto would build a model and provide quantification. Salmon would probably give similar results to kallisto if this were the case.

ADD REPLY
0
Entering edit mode

Thanks for your reply, ATpoint. Does 'skip the gene' mean discarding reads aligned to the overlapped region? If it is, it seems that the overlapped region is just a small part and the high TPM value and 0 read count means that all the reads belonging to mt-Atp8 are all mapped to the overlapped region. But I think it is quite rare... That's my thought...

ADD REPLY
0
Entering edit mode

Does 'skip the gene' mean discarding reads aligned to the overlapped region?

Yes, could be but this is just a guess.

After all, in the end you have to decide for one method and go along with it. There will always be edge cases where one or the other method performs better. I prefer salmon for RNA-seq and only use featureCounts for DNA-seq such as ChIP-seq to get a count matrix.

ADD REPLY
0
Entering edit mode

You also have the -O (as in capital letter o) for allowMultiOverlap where a read is attributed to each feature it overlaps. To not massively increase the count, you can specify −−fraction so each read counts its relative fractional part towards the respective features. It might not work ideal, as I for example end up with count discrepancies between what the featureCount summary reports to have assigned and I sum up from the count matrix.

ADD REPLY

Login before adding your answer.

Traffic: 2530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6