Question: compare gene count profiles between samples
0
gravatar for Abdullah
20 months ago by
Abdullah90
Germany
Abdullah90 wrote:

Hi,

I have a list of RNA-Seq from cancer samples to which I analyzed and generated the RPKM and TPM (calculated from RPKM) for all of the genes. Additionally, I downloaded a large set of public cancer samples (multiple cancer types) which I want to use as a comparison data and also generated the RPKM and TPM similarly.

What I want to do is to find the public sample that matches the best to my samples (cancer type wise) by comparing RPKM/TPM profiles accross a set of genes (N) of interest (or known to be expressed for each cancer type). I read that using RPKM is a bad idea, so I switched to TPM to do this task.

For each one of my samples, I take the TPM distribution of the N genes and compare it with the TPM distribution of each of the public samples and get a p-value (e.g., KS test). But this does not seem to be the best idea.

Can anyone guide me to another type of tests that can be usefull in my case? or can one for example build a model of TPM of all cancer samples of the same type then compare my samples back each of these models?

Thanks.

rna-seq • 886 views
ADD COMMENTlink modified 19 months ago by Ar710 • written 20 months ago by Abdullah90

Instead of using a KS or other test to find the closest sample, have you tried just computing the correlation and taking the sample the highest one?

ADD REPLYlink written 20 months ago by Devon Ryan74k

Thank you Devon. I thought correlation is too simple, but i will give it a try. do you suggest doing any kind of normalization or batch effect fixing? or directly perform the correlation on the TPMs?

ADD REPLYlink written 20 months ago by Abdullah90

The whole thing is one giant batch effect, so you have no hope of getting around that without using something like RUVseq (assuming you can find some genes expected not to be DE in cancer). Personally, I'd start with a rank-based correlation of the TPM values and see if anything is particularly close. This is sort of a worst case scenario with human samples and cancer, I'm surprised whoever collected the patient samples didn't try to get neighboring tissue.

As an aside, you could theoretically try to do signal separation of the (presumably heterogenous) cancer samples and use the results as matching pairs (you may recall that Katarzyna presented a journal club paper related to that). I've always been highly skeptical of those methods applied to RNAseq data, but perhaps the SVM-based ones are OK at that (as opposed to the ICA-based methods).

ADD REPLYlink written 20 months ago by Devon Ryan74k
1

FYI, The correlation did not yield the expected results. On the other hand, I tried RUVSeq and it seems to reduce the variability between the samples of the same cancer type quite nicely. I'm trying to use the PCA function as a method to see where my sample would fall in the PCA plot of each cancer type (or where all cancer types are PCA'de).

I remember the talk of Katarzyna, but I don't remember in details how did she address this problem..

ADD REPLYlink written 19 months ago by Abdullah90
1

I'd have to look up the paper, but it was using some sort of SVM-based method to do signal separation. You might find some relevant methods if you google "SVM signal separation RNAseq".

ADD REPLYlink written 19 months ago by Devon Ryan74k
0
gravatar for Ar
19 months ago by
Ar710
United States
Ar710 wrote:

Couple of questions before we decide which statistical test would be relevant

  1. Is the downloaded dataset has similar conditions to your cancer dataset i.e. is all your samples primary or they do have some met ? And, whether the downloaded sample set have equal proportions of primary and met as yours ?
  2. Is your dataset before treatment or after treatment ? If you have downloaded the TCGA samples then you should know that these are treated samples.

I think KS distribution is good if you want to compare if the samples are from same population. Now, if you have a combination of primary and met for your group1, then gene expression profiles are way different and it will not be ideal to compare them together. Probably, that is why you are not getting the significant p-values.

Few recommendations:

  1. Use PCA or MDS plot and see which samples cluster together ? Check which genes are responsible for clustering of these samples and distinguishing these clusters. Beware of Batch Effects i.e. if the samples are clustering based on batches (i.e. your dataset and public).
  2. Label the samples based on primary or met and then compare the samples using both KS and Mann-Whitney.
ADD COMMENTlink modified 19 months ago • written 19 months ago by Ar710
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1572 users visited in the last hour