Question

compare gene count profiles between samples

0

Entering edit mode

7.9 years ago

Abdullah ▴ 100

Hi,

I have a list of RNA-Seq from cancer samples to which I analyzed and generated the RPKM and TPM (calculated from RPKM) for all of the genes. Additionally, I downloaded a large set of public cancer samples (multiple cancer types) which I want to use as a comparison data and also generated the RPKM and TPM similarly.

What I want to do is to find the public sample that matches the best to my samples (cancer type wise) by comparing RPKM/TPM profiles accross a set of genes (N) of interest (or known to be expressed for each cancer type). I read that using RPKM is a bad idea, so I switched to TPM to do this task.

For each one of my samples, I take the TPM distribution of the N genes and compare it with the TPM distribution of each of the public samples and get a p-value (e.g., KS test). But this does not seem to be the best idea.

Can anyone guide me to another type of tests that can be usefull in my case? or can one for example build a model of TPM of all cancer samples of the same type then compare my samples back each of these models?

Thanks.

RNA-Seq • 3.5k views

ADD COMMENT • link updated 7.9 years ago by Ar ★ 1.1k • written 7.9 years ago by Abdullah ▴ 100

0

Entering edit mode

Instead of using a KS or other test to find the closest sample, have you tried just computing the correlation and taking the sample the highest one?

ADD REPLY • link 7.9 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you Devon. I thought correlation is too simple, but i will give it a try. do you suggest doing any kind of normalization or batch effect fixing? or directly perform the correlation on the TPMs?

ADD REPLY • link 7.9 years ago by Abdullah ▴ 100

0

Entering edit mode

The whole thing is one giant batch effect, so you have no hope of getting around that without using something like RUVseq (assuming you can find some genes expected not to be DE in cancer). Personally, I'd start with a rank-based correlation of the TPM values and see if anything is particularly close. This is sort of a worst case scenario with human samples and cancer, I'm surprised whoever collected the patient samples didn't try to get neighboring tissue.

As an aside, you could theoretically try to do signal separation of the (presumably heterogenous) cancer samples and use the results as matching pairs (you may recall that Katarzyna presented a journal club paper related to that). I've always been highly skeptical of those methods applied to RNAseq data, but perhaps the SVM-based ones are OK at that (as opposed to the ICA-based methods).

ADD REPLY • link 7.9 years ago by Devon Ryan 104k

1

Entering edit mode

FYI, The correlation did not yield the expected results. On the other hand, I tried RUVSeq and it seems to reduce the variability between the samples of the same cancer type quite nicely. I'm trying to use the PCA function as a method to see where my sample would fall in the PCA plot of each cancer type (or where all cancer types are PCA'de).

I remember the talk of Katarzyna, but I don't remember in details how did she address this problem..

ADD REPLY • link 7.9 years ago by Abdullah ▴ 100

1

Entering edit mode

I'd have to look up the paper, but it was using some sort of SVM-based method to do signal separation. You might find some relevant methods if you google "SVM signal separation RNAseq".

ADD REPLY • link 7.9 years ago by Devon Ryan 104k

score 0 · Answer 1 · 2016-06-12

Couple of questions before we decide which statistical test would be relevant

Is the downloaded dataset has similar conditions to your cancer dataset i.e. is all your samples primary or they do have some met ? And, whether the downloaded sample set have equal proportions of primary and met as yours ?
Is your dataset before treatment or after treatment ? If you have downloaded the TCGA samples then you should know that these are treated samples.

I think KS distribution is good if you want to compare if the samples are from same population. Now, if you have a combination of primary and met for your group1, then gene expression profiles are way different and it will not be ideal to compare them together. Probably, that is why you are not getting the significant p-values.

Few recommendations:

Use PCA or MDS plot and see which samples cluster together ? Check which genes are responsible for clustering of these samples and distinguishing these clusters. Beware of Batch Effects i.e. if the samples are clustering based on batches (i.e. your dataset and public).
Label the samples based on primary or met and then compare the samples using both KS and Mann-Whitney.