Expression cutoff for UMI
1
1
Entering edit mode
2.9 years ago
JulianC ▴ 30

Hi!

It is known that generally a gene is considered to be expressed if its expression value, measured as FPKM or TPM, is more than 0.5 (https://www.ebi.ac.uk/gxa/help/index.html, in the paragraph ''Baseline expression results'). I'm just wondering what is this cutoff when the gene expression is measured as UMI (unique molecular identifier). I did not find specifically this information, and I am working with a dataset produced by Cell Ranger (10x Genomics). I would like to know if there is an expression cutoff for UMI or if this measure can be converted into TPM. Thank you in advance!

UMI Gene expression • 2.8k views
5
Entering edit mode
2.9 years ago

You can easily convert UMI into TPM by summing the UMI counts for all transcripts to get total_count and then divide the UMI count for each transcript by this, and them multiple by 1,000,000:

 total_umi = sum(umi_counts)
TPM = umi_counts*1000000/total_umi


However, you should be aware that UMI counts are closer than any other measure in RNA-seq to being an absolute measure, rather than a composional measure (which TPM is).

BTW, I'd question the idea that 0.5 FPKM/TPM is a good cutoff for expression, if for no other reason than that 0.5 FPKM is very different to 0.5TPM. In many samples I've looked at 0.5FPKM = 2.5TPM, but this relationship varies from sample to sample, thats kind of the point. Its also why its so hard to establish a cutoff for expression.

If you have one UMI from a transcript, and that UMI did come from that transcript, then that transcript is expessed! (of course you can't know for certain that the read did come from the transcript). Indeed, the sensitivity of scRNA-seq is probably low enough that plenty of expressed genes have 0 UMI counts.

0
Entering edit mode

Thank you @i.sudbery for the explanation. I suppose that the 'total umi' is calculated over one sample (in case of a single-cell dataset, this means that each cell will have its own 'total umi'). I think I have a bad dataset, since the sum of all UMIs in one of the cells (but the others are not so different) is 1249, calculated over a total of 23977 genes. This means that a gene (in the same cell) that has UMI = 6 will have TPM = 6*10^6/1249 = 4803,8 , which seems unrealistic.

1
Entering edit mode

Yes, I don't think the TPM measure was ever really intended for single cell sequencing. Often in single cell the mesaure people use is "genes detected", and they use a single UMI count to say a gene has been detected. I don't know what a normal number is, but it is usually in the low thousands I think.