I want to ask for your opinion. First of all I will explain my problem. I have around ~1000 genes of interest and all of them are transcription factor. I want to see the expression profile and do diff. exp. analysis between cancer-normal. I downloaded all TCGA BRCA dataset from GDC. I got more than 1000 samples, around 300 for normal and the remaining are cancer.
After I do diff. exp. analysis using DESeq2, I only got several differentially expressed gene from my genes of interest list and feel weird about it.
Then I tried to subset the dataset. I use only 10 for both normal and cancer. I just choose randomly from the sampel I have downloaded. Then, I ran DESeq2 again and the result is quite normal with a lot of differentially expressed genes.
My questions are:
Why using many samples will give weird result (only several genes are differentiall expressed)? Does this means there are "subtypes" in those 1000+ samples of BRCA (my hypothesis are because the sample variances are huge, DESeq2 can calculate differentially expressed gene accurately)?
If I want to choose samples that have similar gene expression profile from this 1000+ samples, what is the best method? I know K-means clustering and hierarchical clustering?