I am trying to do ssGSEA analysis on my dataset (RNA-seq data) using the GVSA package. I have a few questions:
For the input data, should I put in raw counts or normalised counts? At the moment, I am using deseq normalised counts - is that acceptable? The GenePattern ssGSEA module gave the options of "rank", "log" and "log.rank". How will the different normalisation method affect the result?
Second question is slightly long-winded. Please bear with me.
a) I've created my own GMT files for 29 gene sets based on a paper. However, after double checking, The gene sets contained 12 genes that is not available in my dataset. I am unsure how that happened and wanted to make sure that I have not filtered these genes out in my pre-processing steps. After checking, I realised that these genes were not present in my raw counts data (output from featureCounts). My FASTA files are aligned to GRCh37 and that is what I used for featureCounts too. Does anyone have this problem before? The missing genes contained pretty common genes so I am a bit perplexed why they are missing:
"MMP12" "MRC1" "CCL15" "IL4" "IFNA2" "CCL3" "C10orf54" "TRAC" "TRBC1" "TRBC2" "CCL4" "CCL5"
If anyone encountered this before and can tell me what went wrong that will be great.
b) I imagine missing a few genes will make a different to the ssGSEA result and some of my gene sets are pretty small (only have 8 genes). How does GSVA deal with a missing gene in the gene set? So I need to remove the gene from my GMT file before proceeding?
- My last question is on the minimum gene set size. As some of my gene set are very small (<10 genes) and the default is set as 10, will that affect the reliability of the ssGSEA analysis?
FYI I have input what I have at the moment (deseq normalised counts) and no errors were given from GSVA. But I would like to know how reliable is the results given I've had the issues above. Thank you!
gsva_results <- gsva( deseq_norm_counts, my_gmt, method = "ssgsea", # kcdf = "Gaussian" if we have expression values that are continuous such as log-CPMs, log-RPKMs or log-TPMs, kcdf = "Poisson" on integer counts. kcdf = "Poisson", # Minimum gene set size min.sz = 1, # Maximum gene set size max.sz = 500, # Compute Gaussian-distributed scores mx.diff = TRUE, # Don't print out the progress bar verbose = T)