Hello. I'm working with mice RNA-seq data and am using public datasets from GEO. Most datasets I've encountered are read count matrices with MGI gene symbols. Is there a way to convert counts to TPM values only using gene symbols and the count values? The thing I'm missing is the transcript length, which I can obtain using biomaRt, but most genes have multiple transcripts, so I'm not sure which one to pick.
It is not that trivial. To properly normalize for length you would need to know which isoforms are expressed in each sample. Say in sample1 isoform1 was exclusively active which was 5kb long but in sample2 only isoform2 was active which was 7.5kb long. Given equal expression the counts in sample2 would be 25% higher which one needs to account for. From a simple count matrix you cannot do this.
The imho most proper way would be to download the raw fastq files ( Fast download of FASTQ files from the European Nucleotide Archive (ENA) ), quantify against the reference transcriptome with
salmon and then use
tximport to aggregate counts to the gene level while correcting for the length of the isoforms being active in the respective samples. Normalized counts could then be obtained with DESeq2 (see DESeq2 manual, e.g.
rlog transformation) or edgeR (see https://support.bioconductor.org/p/121087/). Any of these choices is better than TPM as TPM failes to correct for library composition differences between samples.
What do you want to do, differential analysis or any clustering/ML application?