Question

Using Seurat to take in a Normalised TPM Gene to Cell matrix as a Seurat Object

0

Entering edit mode

3.1 years ago

saad.yousuf • 0

Hello everyone, I am trying to import a dataset into RStudio via Seurat as here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE75688 (filename : GSE75688_GEO_processed_Breast_Cancer_raw_TPM_matrix.txt.g)

This matrix contains normalized tpm values (not log transformed and nor any other processing has been done on it).

What's the recommended way to import the txt matrix file into Rstudio and a Seurat Object(I tried the conversion to csv method via Excel and then easily imported it but was thinking that there could be something easier)?.

Next, I was wondering what's the most efficient route(on the matrix) to apply log transform, further filtration on it(like remove those genes(rows)/cells(columns) with insufficient TPM values or expressive ability across cells), select the HVG genes and eventually perform dimensional reduction and PCA/Cluster analysis. Any commands I need to keep mind of?

Are there any useful threads or tutorials regarding importing and using TPM files directly into Seurat as in the case above as well?

Regards

format sequencing rna • 7.9k views

ADD COMMENT • link updated 2.9 years ago by jared.andrews07 ★ 16k • written 3.1 years ago by saad.yousuf • 0

score 2 · Answer 1 · 2021-03-26

2

Entering edit mode

3.1 years ago

jared.andrews07 ★ 16k

In short, don't. Seurat (and most other sound package/analysis where differential expression is the end-goal) require raw counts. TPM values are generally unsuitable for between-sample comparisons. This has been explained many times in many places - you can search for any of the dozens of questions on this site relating to this.

If there is absolutely no way for you to get the raw data, you can try using limma-trend for DE.

ADD COMMENT • link 3.1 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

I have a follow up question to this. I indeed found several sources saying you should not use TPM values for DE analysis. However, I got confused when reading Seurat's tutorial because it seems to me that they're using log(TPM+1) values there for DE.

First, the FindAllMarkers() is used to calculate differentially expressed genes for each cluster. The reference for this function shows that the "data" slot is used as a default to pull data from. Second, the "data" slot contains the normalized data which is generated from raw counts with NormalizeData. By default this function uses "LogNormalize" method. My understanding after reading the reference is that the method 1. calculates the number of total counts in each cell 2. divides all feature counts within the cell with this total count 3. multiplies values with the scaling factor 10,000 4. natural log transforms the values. But aren't the steps 1, 2 and 3 just creating TP10K values - slightly different version of TPM? Moreover, the developers have stated that TPM values should be ok (https://bioinformatics.stackexchange.com/questions/5115/seurat-with-normalized-count-matrix). Have I misunderstood something here?

ADD REPLY • link 2.9 years ago by resa • 0

0

Entering edit mode

TPMs usually account for gene length, which the normalized values derived from Seurat do not. If you haven't length-scaled the values, then I suppose you could still use them, otherwise, I imagine you'd be introducing noise. If you're comparing the relative expression of a gene between samples, then the length of said gene doesn't matter. The length of the gene only matters when trying to compare to other genes within the same sample, as larger genes will naturally have higher counts, thus making TPMs useful for within-sample comparisons.

Seurat's normalized values are more akin to CPM (or CP10k, I guess) than TPM. If you wanted to use CPM values, I imagine that's probably fine.

ADD REPLY • link 2.9 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

Thanks for the good and prompt answer! I'm relatively new to RNA sequencing and wasn't aware of the CPM and it's difference with TPM. It seems to me that TPM is sometimes used as a synonym for CPM (like here: https://github.com/satijalab/seurat/issues/1498) when the normalized data is from UMI-based protocols. I could imagine that the reported lack of gene length bias with UMIs (https://europepmc.org/article/med/28529717) has led some to skip the gene length correction but still call the normalized values TPM.

ADD REPLY • link 2.9 years ago by resa • 0

1

Entering edit mode

Too hung-up on bulk stuff lately - I rather forgot about the UMI aspect. TPM is probably as accurate a descriptor as any then, though a bit confusing.

ADD REPLY • link 2.9 years ago by jared.andrews07 ★ 16k