I was wondering if any of you could give me some advice with the following issues:
- All results are different when performing "absolute" deconvolution using legacy CIBERSORT source code R version (no.sumto1 & sig.score), legacy CIBERSORT online tool and the online CIBERSORTx tool (see example copied below). Has there been a change in the code between versions?
- I have very large matrices to deconvolute (up to 780MB). Is it possible to split up matrices for deconvolution of samples I want to compare?
- How does CIBERSORT accomplish to exclude non-hematopoietic genes during signature matrix genetartion? Is there an internal database of "hematopoietic" and "non-hematopoietic" genes? In this case, I am wondering which gene annotation tool and which snapshot thereof was used to define "non-hematopoietic" genes? The reason why I am asking is, that I am concerned this will affect signature matrix generation depending on which annotation tool and version I use for gene annotation of the reference file due to tool-specific annotation differences in gene names (alterantive gene names, novel transcripts etc...) that don't match the CIBERSORT internal record.
- For the imputation of cell fractions, how important is it to match gene annotation of signature and mixture in regard to gene-annotation differences caused by different annotation tools/versions. Do I have to assume that the results are always penalised in case the mixture and signature/reference files were annotated in different labs (with a different annotation tool / version)? How would an annotation difference of more than 15% of genes between signature and mixture affect deconvolution performance?
- Error during the imputation of cell fractions using CIBERSORTx: "WARNING: reaching max number of iterations". What is the cause and how can I solve it?
Hi, I'm using CIBERSORTx for check cell types on RNA-Seq data, and I'm not sure about which count normalization it's better for input: CPM or TPM. I'm using LM22 as the signature matrix. The tutorial recommends that we should normalize the mixture file same as signature matrix, however, LM22 is microarray and I don't know how it was normalized. I also have access only to count matrix and using CPM will be easier, but I'm not sure if this is the best way. What signature matrix did you use? How did you normalize your counts?
Hi, I'm pretty sure LM22 is RMA-normalised as I have read that somewhere in the CIBERSORTx paper, could be under the supplementary information section. CIBERSORTx has a batch correction option that removes technical differences between signature matrix and mixture file that derived from different platforms (like your case - a signature matrix derived from microarray [LM22] and mixture file from RNA-seq data), so I think either CPM or TPM is fine. However, I personally would go for TPM because the authors of CIBERSORTx mainly used TPM files for their research/analysis, as reported in their paper... As you can see, I am only following the protocols set by CIBERSORTx team, if anyone has better explanation to which normalisation method to use, I would love to hear it!
Hi! I was wondering if there's any follow-ups to all the questions posted? Because I am having the same questions in mind as well, especially the second question. I'd be really grateful if you could please share any updates/findings with me. If not, have you tried contacting the cibersortx team? (because I did but haven't heard anything back from them...).
Hi, I'm also wondering the same thing about matrices being too large, it's not allowed to upload them. Just to try CIBERSORTx, I reduced the size of my matrices by leaving out some samples but then it gives an error after a long run: "cannot allocate memory". I contacted the team but no response so far. If somebody already tackled this problem, that would be so helpful to share it.
Hi berry, I have run into the same error before too and then realised it could mean that the quota limit has been exceeded (>1GB) - basically you have to take the size of the results file into consideration too. What I did was keep reducing the size of the matrices until there is also enough space for my results file. I have asked the team about using large matrices as input and they advised running CIBERSORTx via Docker. Hope this helps!
Hi jill.syx, thank you very much for your answer, it's very helpful. I have one last question, how did you keep reducing the size of your files? Is there a way to compress these txt formatted matrices? Or you also had to remove some samples? (my single cell reference matrix is too large even though I reduced it to 3 10X samples)
Hi berry, I presume you're talking about the first module of CIBERSORTx (creating signature matrix), I also presume that when you say a sample you mean a certain cell type instead of a single cell.
Yes, I did removed some samples but what I removed most was the cells within one sample, e.g. if I have 1000 cells within sample A (or cell type labelled A), I remove half of the cells so that I'm only left with 500 cells (by random selection). And I also kept this consistent for other samples (e.g. if I remove 50% of the cells in sample A, I also remove 50% in sample B, C etc..). This shouldn't affect too much of the results because based on what I read and understood from the CIBERSORTx paper, it kind of does the same too as it by default takes only 50% of the cells from a sample to build the signature matrix (also done by random selection without replacement; you can also change this to any percentage that you want in the "sampling" under "single cell input options" tab). If you're afraid that this will increase the results variability, I suggest doing several repeats and see whether you will still get the same deconvolution results in the end, though this is extremely time-consuming...
Another thing I did to reduce the file size was filter out genes with no expression across all cells.
I hope all this makes sense to you! Happy to answer any further questions that you might have!
Thank you very much!