Hello. I'm working with mice RNA-seq data and am using public datasets from GEO. Most datasets I've encountered are read count matrices with MGI gene symbols. Is there a way to convert counts to TPM values only using gene symbols and the count values? The thing I'm missing is the transcript length, which I can obtain using biomaRt, but most genes have multiple transcripts, so I'm not sure which one to pick.
Thanks for the answer. I'm applying ML algorithms to scRNA data and have been getting decent results with simply using 'transcript_length'(s) from biomaRt to get the TPM values, just wondering whether there might be a more correct approach. I'm somewhat hesitant to deal with raw data, as it seems to take an obnoxious amount of time to process it.
I am not too familiar with the single-cell world but there are dedicated normalization methods for single-cell data that respect the characteristics of these data such as the zero inflation. Don't use TPM, as simple per-million methods have been shown multiple times to perform poorly for inter-sample (or here inter-cell) comparison. Check out zinbwave for example and search the web for recommendations on how to transform raw counts from single-cell data.