Obtaining a TPM matrix from counts
1
1
Entering edit mode
5.0 years ago
tadasbar ▴ 10

Hello. I'm working with mice RNA-seq data and am using public datasets from GEO. Most datasets I've encountered are read count matrices with MGI gene symbols. Is there a way to convert counts to TPM values only using gene symbols and the count values? The thing I'm missing is the transcript length, which I can obtain using biomaRt, but most genes have multiple transcripts, so I'm not sure which one to pick.

RNA-Seq R • 3.1k views
ADD COMMENT
2
Entering edit mode
5.0 years ago
ATpoint 82k

It is not that trivial. To properly normalize for length you would need to know which isoforms are expressed in each sample. Say in sample1 isoform1 was exclusively active which was 5kb long but in sample2 only isoform2 was active which was 7.5kb long. Given equal expression the counts in sample2 would be 25% higher which one needs to account for. From a simple count matrix you cannot do this.

The imho most proper way would be to download the raw fastq files ( Fast download of FASTQ files from the European Nucleotide Archive (ENA) ), quantify against the reference transcriptome with salmon and then use tximport to aggregate counts to the gene level while correcting for the length of the isoforms being active in the respective samples. Normalized counts could then be obtained with DESeq2 (see DESeq2 manual, e.g. vst or rlog transformation) or edgeR (see https://support.bioconductor.org/p/121087/). Any of these choices is better than TPM as TPM failes to correct for library composition differences between samples.

What do you want to do, differential analysis or any clustering/ML application?

ADD COMMENT
0
Entering edit mode

Thanks for the answer. I'm applying ML algorithms to scRNA data and have been getting decent results with simply using 'transcript_length'(s) from biomaRt to get the TPM values, just wondering whether there might be a more correct approach. I'm somewhat hesitant to deal with raw data, as it seems to take an obnoxious amount of time to process it.

ADD REPLY
0
Entering edit mode

I am not too familiar with the single-cell world but there are dedicated normalization methods for single-cell data that respect the characteristics of these data such as the zero inflation. Don't use TPM, as simple per-million methods have been shown multiple times to perform poorly for inter-sample (or here inter-cell) comparison. Check out zinbwave for example and search the web for recommendations on how to transform raw counts from single-cell data.

ADD REPLY

Login before adding your answer.

Traffic: 2200 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6