Sample clustering from TCGA dataset
1
0
Entering edit mode
4.2 years ago
bharata1803 ▴ 530

Hello all,

I want to ask for your opinion. First of all I will explain my problem. I have around ~1000 genes of interest and all of them are transcription factor. I want to see the expression profile and do diff. exp. analysis between cancer-normal. I downloaded all TCGA BRCA dataset from GDC. I got more than 1000 samples, around 300 for normal and the remaining are cancer.

After I do diff. exp. analysis using DESeq2, I only got several differentially expressed gene from my genes of interest list and feel weird about it.

Then I tried to subset the dataset. I use only 10 for both normal and cancer. I just choose randomly from the sampel I have downloaded. Then, I ran DESeq2 again and the result is quite normal with a lot of differentially expressed genes.

My questions are:

  1. Why using many samples will give weird result (only several genes are differentiall expressed)? Does this means there are "subtypes" in those 1000+ samples of BRCA (my hypothesis are because the sample variances are huge, DESeq2 can calculate differentially expressed gene accurately)?

  2. If I want to choose samples that have similar gene expression profile from this 1000+ samples, what is the best method? I know K-means clustering and hierarchical clustering?

RNA-Seq deseq2 • 1.6k views
ADD COMMENT
0
Entering edit mode
4.2 years ago

1- If you were to run the analysis multiple times with small random samples, I think you would get different results each time. You don't say what is your threshold for calling a gene significantly differentially expressed. Small sample size means low statistical power which means high false positive rate.
2- You need to choose a similarity/distance measure that is not subject to the distance concentration phenomenon, i.e. for noisy data, some distance/similarity measures tend towards a constant as the number of dimensions increases, or use a dimensionality reduction method. Which clustering algorithm you choose depends on assumptions you can make about the clusters, i.e. k-means assumes the clusters to be spherical. Hierarchical clustering is usually good to get an impression of what clusters with what but it can be difficult to find a good tree cutting strategy.
Also if the data comes from microarrays, you may want to use limma instead of DESeq2.

ADD COMMENT
0
Entering edit mode
  1. I use |0.65| threshold for significantly up/down regulated. I agree that larger number of samples will give stronger statistical result but in this case, I am wondering whether there are variation in the tumor gene expression profile that make analysis for some genes don't give a strong result. That's why I think maybe clustering the sample first will be more useful so that I can use smaller size of sample group with the most similarity in expression profile. Smaller size can make analysis is quicker.

  2. Data from TCGA is from RMA-seq. The data I downloaded is in the form of htseq-count result. Maybe I will try R clustering package and see the result but at first maybe see the data first by using standard method like PCA or heatmap.

ADD REPLY
0
Entering edit mode

Visualizing the data is always a good idea. Another thing you could do is filter out some genes, for example those that can be considered as not expressed.

ADD REPLY
0
Entering edit mode

thank you for your suggestion.

ADD REPLY

Login before adding your answer.

Traffic: 1502 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6