Recently, I'm reading a paper and checked the attachment(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE159919). I found that read counts for gene ENSMUSG00000064356 in file GSE1599_PolyA_read_counts is 0 in sample mES_Chr_rep_1_polyA. While the TPM of this gene in file GSE159919_polyA_gene_expresion is 6289.54. That's confusing. Then I asked the author of this paper. He/She answered, "Both are correct due to the fundamental difference between feature count and Kallisto.". Can someone tell me is that possible? The feature count is 0 while TPM is more than 5,000. I checked the calculation method of TPM, and if the feature count is zero then the TPM should also be 0.
yeah, sounds a doggy explanation of them to me as well
(on the other hand, I'm a salmon kinda guy, so .... ;) )
It is a gene on chrM that overlaps another gene in the same orientation. That is probably one of those edge cases where the
featureCountdefaults skip that gene (hence count 0) because of the ambiguity, and the pseudoaligners use their EM (or whatever stats magic) algorithm to still assign those reads to any (or both) of the features.
ENSMUSG00000064356 is mt-Atp8
would featurecount not 'count' the reads in the non-overlap part (if remember well that's how htseq-count will do it)? or will it skip it completely (apparently)?
Reads could also be multi-mapped which would cause
featureCountsto omit them.
The screenshot is only 500bp, so that left gene is probably like 250bp. Depending on read length and library method maybe there is simply not left in terms of "truely unique" reads for that gene. Just thinking aloud.
There could also be two other possibilities. The reads being quantified to those genes by kallisto could actually map better to an intergenic region (in which case Salmon + selective alignment would have quantification closer to featureCounts). The second possibility is that there are duplications or repeats of this gene causing multimapping, in which case featureCounts ignores multimappers by default, but kallisto would build a model and provide quantification. Salmon would probably give similar results to kallisto if this were the case.
Thanks for your reply, ATpoint. Does 'skip the gene' mean discarding reads aligned to the overlapped region? If it is, it seems that the overlapped region is just a small part and the high TPM value and 0 read count means that all the reads belonging to mt-Atp8 are all mapped to the overlapped region. But I think it is quite rare... That's my thought...
Yes, could be but this is just a guess.
After all, in the end you have to decide for one method and go along with it. There will always be edge cases where one or the other method performs better. I prefer
salmonfor RNA-seq and only use featureCounts for DNA-seq such as ChIP-seq to get a count matrix.
You also have the
-O(as in capital letter o) for allowMultiOverlap where a read is attributed to each feature it overlaps. To not massively increase the count, you can specify
−−fractionso each read counts its relative fractional part towards the respective features. It might not work ideal, as I for example end up with count discrepancies between what the featureCount summary reports to have assigned and I sum up from the count matrix.