Question: Obtaining a TPM matrix from counts
gravatar for tadasbar
16 months ago by
tadasbar10 wrote:

Hello. I'm working with mice RNA-seq data and am using public datasets from GEO. Most datasets I've encountered are read count matrices with MGI gene symbols. Is there a way to convert counts to TPM values only using gene symbols and the count values? The thing I'm missing is the transcript length, which I can obtain using biomaRt, but most genes have multiple transcripts, so I'm not sure which one to pick.

rna-seq R • 921 views
ADD COMMENTlink modified 16 months ago • written 16 months ago by tadasbar10
gravatar for ATpoint
16 months ago by
ATpoint40k wrote:

It is not that trivial. To properly normalize for length you would need to know which isoforms are expressed in each sample. Say in sample1 isoform1 was exclusively active which was 5kb long but in sample2 only isoform2 was active which was 7.5kb long. Given equal expression the counts in sample2 would be 25% higher which one needs to account for. From a simple count matrix you cannot do this.

The imho most proper way would be to download the raw fastq files ( Fast download of FASTQ files from the European Nucleotide Archive (ENA) ), quantify against the reference transcriptome with salmon and then use tximport to aggregate counts to the gene level while correcting for the length of the isoforms being active in the respective samples. Normalized counts could then be obtained with DESeq2 (see DESeq2 manual, e.g. vst or rlog transformation) or edgeR (see Any of these choices is better than TPM as TPM failes to correct for library composition differences between samples.

What do you want to do, differential analysis or any clustering/ML application?

ADD COMMENTlink written 16 months ago by ATpoint40k

Thanks for the answer. I'm applying ML algorithms to scRNA data and have been getting decent results with simply using 'transcript_length'(s) from biomaRt to get the TPM values, just wondering whether there might be a more correct approach. I'm somewhat hesitant to deal with raw data, as it seems to take an obnoxious amount of time to process it.

ADD REPLYlink written 16 months ago by tadasbar10

I am not too familiar with the single-cell world but there are dedicated normalization methods for single-cell data that respect the characteristics of these data such as the zero inflation. Don't use TPM, as simple per-million methods have been shown multiple times to perform poorly for inter-sample (or here inter-cell) comparison. Check out zinbwave for example and search the web for recommendations on how to transform raw counts from single-cell data.

ADD REPLYlink written 16 months ago by ATpoint40k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2356 users visited in the last hour