I am quite new to RNA-seq data analysis methods (and statistics) and I hereby I would like to ask for some help, suggestions and personal experience related to dealing with GTEx dataset. Also, my apologies in advance if this is a redundant question, however, when I was reading some forums I did not find satisfactory explanations and suggestions.
For my work, I was planning to use the publicly available GTEx data from human tissues. I would like to compare RNA-seq data from certain tissues from healthy and diseased samples. My aim would be to find Differentially expressed genes and to make additional comparisons later on ... so for this:
- I downloaded the latest GTEx Raw readcount and TPM matrices.
- From the RAW readcount matrix, I extracted those columns that matches my criteria. (selected tissues, sex, disease, etc.)
- As a pilot run , I selected 5 - 5 samples from healthy (HE) and diseased (DIS) (preferably with the highest RNA quality (RIN) values) and run a DEseq2 analyses.
As expected the PCA plot showed quite a large inter-sample variation, but surprisingly the HE and DIS samples were also not separated well. Nevertheless, I got only a few significant DEG, that was not matching with previously reported gene expression profiles.
We thought that we could find more DEG by pre-selecting more "similar" datasets for each conditions. To find more "similar" samples for each condition, I generated a PCA plot from all HE samples and DIS samples separately.
Based on these PCA plots, 10 - 10 samples were selected that were showing smaller distances from each other in both PC. After running DEseq2 on 10 vs 10 samples, the PCA plot showed again large inter-sample variations also within each conditions and returning zero significant DEG.
I also tried the same samples with limma-voom, but ended up with zero DEG again.
After this experience, I would like to ask the following questions:
- Can we use large public datasets like GTEx for DEG analyses?
- (If yes) How many samples per conditions are optimal to find DEGs and keeping the inter-sample variation less disturbing?
- is there any good method , pipeline, tool, etc... to find DEGs when we have many samples per conditions and batch effects?
- Is it a good idea to pre-select samples that are "more identical" (grouping together) based on a two dimensional PCA plot?
- Anyone else has similar experience with GTEx?
- bonus question : is it possible that some samples are mixed up on the GTEx database?
Looking forward to your responses. And thank you in advance for any suggestions and comments.