Would really appreciate any thoughts/opinions on what might be going on with my dataset - thanks in advance :)
I performed RNA-seq (bulk) of kidney samples from a cohort of 50 patients with a defined disease, confirmed by pathologist review of sample slides. When I visualized the transcriptomes by tSNE for fun, I found very distinct clustering - the groups were split 60:40 (similar results on PCA, UMAP). I can't identify any differences between the groups - no major differences in age, sex, clinical parameters collected, sequencing batch, or anything else I can think of. Weirdly, the genes that are differentially expressed between the two (DESeq2, p adjusted < 0.05, absolute log2FoldChange >1) are almost all upregulated in one group (let's call it group A), except for a single poorly annotated transcript in the other group (group B). Moreover - the genes that are upregulated are all very closely correlated - median Pearson's rho of 0.87! Many (most) of these genes are not known to be expressed in the kidney, and are mostly absent completely from group B.
We looked to see if there was any other tissue sampled in our biopsy in that cluster, which could explain it... but nothing clear on the pathology report, and pathway enrichment doesn't show much; the genes don't really follow an organ-specific pattern, a lot of weird neural/endocrine/gonadal homeoboxes and transcription factors. Again, the pattern of weird genes is very closely conserved among all group A samples.
My senses are tingling for something technical explaining this. But I can't think of what. The RNAseq was done on FFPE samples, so maybe this is all just poor-quality RNA... but why would the SAME set of genes be upregulated in poor quality samples? Has anyone ever seen anything like this, or have any insight on what might be going on?