Question: compare gene count profiles between samples
gravatar for Abdullah
4.7 years ago by
Abdullah100 wrote:


I have a list of RNA-Seq from cancer samples to which I analyzed and generated the RPKM and TPM (calculated from RPKM) for all of the genes. Additionally, I downloaded a large set of public cancer samples (multiple cancer types) which I want to use as a comparison data and also generated the RPKM and TPM similarly.

What I want to do is to find the public sample that matches the best to my samples (cancer type wise) by comparing RPKM/TPM profiles accross a set of genes (N) of interest (or known to be expressed for each cancer type). I read that using RPKM is a bad idea, so I switched to TPM to do this task.

For each one of my samples, I take the TPM distribution of the N genes and compare it with the TPM distribution of each of the public samples and get a p-value (e.g., KS test). But this does not seem to be the best idea.

Can anyone guide me to another type of tests that can be usefull in my case? or can one for example build a model of TPM of all cancer samples of the same type then compare my samples back each of these models?


rna-seq • 2.4k views
ADD COMMENTlink modified 4.6 years ago by Ar1.0k • written 4.7 years ago by Abdullah100

Instead of using a KS or other test to find the closest sample, have you tried just computing the correlation and taking the sample the highest one?

ADD REPLYlink written 4.7 years ago by Devon Ryan98k

Thank you Devon. I thought correlation is too simple, but i will give it a try. do you suggest doing any kind of normalization or batch effect fixing? or directly perform the correlation on the TPMs?

ADD REPLYlink written 4.7 years ago by Abdullah100

The whole thing is one giant batch effect, so you have no hope of getting around that without using something like RUVseq (assuming you can find some genes expected not to be DE in cancer). Personally, I'd start with a rank-based correlation of the TPM values and see if anything is particularly close. This is sort of a worst case scenario with human samples and cancer, I'm surprised whoever collected the patient samples didn't try to get neighboring tissue.

As an aside, you could theoretically try to do signal separation of the (presumably heterogenous) cancer samples and use the results as matching pairs (you may recall that Katarzyna presented a journal club paper related to that). I've always been highly skeptical of those methods applied to RNAseq data, but perhaps the SVM-based ones are OK at that (as opposed to the ICA-based methods).

ADD REPLYlink written 4.7 years ago by Devon Ryan98k

FYI, The correlation did not yield the expected results. On the other hand, I tried RUVSeq and it seems to reduce the variability between the samples of the same cancer type quite nicely. I'm trying to use the PCA function as a method to see where my sample would fall in the PCA plot of each cancer type (or where all cancer types are PCA'de).

I remember the talk of Katarzyna, but I don't remember in details how did she address this problem..

ADD REPLYlink written 4.6 years ago by Abdullah100

I'd have to look up the paper, but it was using some sort of SVM-based method to do signal separation. You might find some relevant methods if you google "SVM signal separation RNAseq".

ADD REPLYlink written 4.6 years ago by Devon Ryan98k
gravatar for Ar
4.6 years ago by
United States
Ar1.0k wrote:

Couple of questions before we decide which statistical test would be relevant

  1. Is the downloaded dataset has similar conditions to your cancer dataset i.e. is all your samples primary or they do have some met ? And, whether the downloaded sample set have equal proportions of primary and met as yours ?
  2. Is your dataset before treatment or after treatment ? If you have downloaded the TCGA samples then you should know that these are treated samples.

I think KS distribution is good if you want to compare if the samples are from same population. Now, if you have a combination of primary and met for your group1, then gene expression profiles are way different and it will not be ideal to compare them together. Probably, that is why you are not getting the significant p-values.

Few recommendations:

  1. Use PCA or MDS plot and see which samples cluster together ? Check which genes are responsible for clustering of these samples and distinguishing these clusters. Beware of Batch Effects i.e. if the samples are clustering based on batches (i.e. your dataset and public).
  2. Label the samples based on primary or met and then compare the samples using both KS and Mann-Whitney.
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Ar1.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1635 users visited in the last hour