Hi everyone,
I'm working on bulk RNA-seq data from The Cancer Genome Atlas (TCGA), which includes both tumor and matched normal tissue samples. My goal is to profile immune cell-type abundances using CIBERSORTx.
Here’s what I’ve done so far:
Input data: TPM-normalized gene expression matrix from TCGA (human samples)
Gene identifiers: HUGO gene symbols (e.g., CD3D, NKG7, etc.)
Signature matrix: LM22 (default immune signature)
Deconvolution parameters: Used absolute mode and batch correction disabled
Unfortunately, the output isn't very promising:
All P-values are > 0.05 (range: 0.13 to 0.91)
RMSE values range between ~1.06 and 1.21
Correlation values are mostly low or even negative (median ~ -0.001)
I’ve double-checked that the input format matches expectations, and that gene symbols are clean. But I’m not sure where the issue lies.
Here’s what I’m wondering:
Is LM22 appropriate for TCGA tumor and normal tissues, especially for immune-rich samples like lymph nodes or solid tumors?
Should I consider building a custom signature matrix using single-cell data from similar tissue types?
Are TPM values directly usable or should I apply further transformations (e.g., log2, quantile normalization)?
What are acceptable thresholds for RMSE and correlation to consider a sample as reliable?
Could the tumor microenvironment complexity or stromal content be affecting CIBERSORTx's performance?
Files I can attach if helpful:
A subset of my input expression matrix (.csv, TPM values )
The CIBERSORTx results table (P-values, RMSE, correlation, proportions)
A brief README with sample type descriptions (tumor vs. normal)
Any suggestions or guidance would be hugely appreciated. I'd also be open to sharing code or steps if helpful for debugging.
Thanks in advance!
Best regards,