Question

Sample clustering from TCGA dataset

0

Entering edit mode

6.5 years ago

bharata1803 ▴ 560

Hello all,

I want to ask for your opinion. First of all I will explain my problem. I have around ~1000 genes of interest and all of them are transcription factor. I want to see the expression profile and do diff. exp. analysis between cancer-normal. I downloaded all TCGA BRCA dataset from GDC. I got more than 1000 samples, around 300 for normal and the remaining are cancer.

After I do diff. exp. analysis using DESeq2, I only got several differentially expressed gene from my genes of interest list and feel weird about it.

Then I tried to subset the dataset. I use only 10 for both normal and cancer. I just choose randomly from the sampel I have downloaded. Then, I ran DESeq2 again and the result is quite normal with a lot of differentially expressed genes.

My questions are:

Why using many samples will give weird result (only several genes are differentiall expressed)? Does this means there are "subtypes" in those 1000+ samples of BRCA (my hypothesis are because the sample variances are huge, DESeq2 can calculate differentially expressed gene accurately)?
If I want to choose samples that have similar gene expression profile from this 1000+ samples, what is the best method? I know K-means clustering and hierarchical clustering?

RNA-Seq deseq2 • 2.3k views

ADD COMMENT • link updated 6.5 years ago by Jean-Karim Heriche 27k • written 6.5 years ago by bharata1803 ▴ 560

score 0 · Answer 1 · 2017-10-26

0

Entering edit mode

6.5 years ago

Jean-Karim Heriche 27k

1- If you were to run the analysis multiple times with small random samples, I think you would get different results each time. You don't say what is your threshold for calling a gene significantly differentially expressed. Small sample size means low statistical power which means high false positive rate.
2- You need to choose a similarity/distance measure that is not subject to the distance concentration phenomenon, i.e. for noisy data, some distance/similarity measures tend towards a constant as the number of dimensions increases, or use a dimensionality reduction method. Which clustering algorithm you choose depends on assumptions you can make about the clusters, i.e. k-means assumes the clusters to be spherical. Hierarchical clustering is usually good to get an impression of what clusters with what but it can be difficult to find a good tree cutting strategy.
Also if the data comes from microarrays, you may want to use limma instead of DESeq2.

ADD COMMENT • link 6.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I use |0.65| threshold for significantly up/down regulated. I agree that larger number of samples will give stronger statistical result but in this case, I am wondering whether there are variation in the tumor gene expression profile that make analysis for some genes don't give a strong result. That's why I think maybe clustering the sample first will be more useful so that I can use smaller size of sample group with the most similarity in expression profile. Smaller size can make analysis is quicker.
Data from TCGA is from RMA-seq. The data I downloaded is in the form of htseq-count result. Maybe I will try R clustering package and see the result but at first maybe see the data first by using standard method like PCA or heatmap.

ADD REPLY • link 6.5 years ago by bharata1803 ▴ 560

0

Entering edit mode

Visualizing the data is always a good idea. Another thing you could do is filter out some genes, for example those that can be considered as not expressed.