I would really appreciate a second opinion on something I am attempting to do or perhaps someone can argue against this and direct me towards something more applicable. ( I did dig up this not-very-related post, where this was suggested in the comments Combining Microarray and RNAseq data).
Data: 2 x timeseries RNA-seq datasets. (Cell_line1 and Cell_line2).
Background: I am trying to capture what makes the cells behave the same way across the timeseries and whether there is something drastically different. Basically comparing gene expression trajectories.
Analysis: Analyzed as separate datasets in DESeq2 using LRT to obtain significantly varying genes across the timepoints.
design(Cell_line1) <- ~ condition
Cell_line1_dds <- DESeq(Cell_line1, test="LRT", reduced = ~1)
Result: ~10000 DE genes in Cell_line1, ~5000 DE genes in Cell_line2. ~4000 overlapping DE genes. The first intuitive step to look at differences was to look at non-shared DE genes and characterize these through ORA. However, I noticed that among the 4000 overlapping DE genes there are a few that behave in completely opposite patterns in the cell lines. E.g. in cell_line1 they are up-regulated and in cell-line2 down-regulated.
Question: I wanted to have a more systematic way of identifying among these 4000 shared DE genes the ones that behave similarly between the cell-lines and the ones that do not. So I tried clustering. As these datasets were analyzed separately the best I could come up with in order to compare the expression patterns between them was to subset the VST-transformed and Z-score scaled data-matrices for these 4000 overlapping DE genes, merge these into one matrix, and use for clustering. So the subset of the 4000 genes for Cell_line1 and Cell_line2 were merged into one large matrix. I made a pseudo example of what that looks like, so the first three columns are the scaled VST counts for Cell_line1 and the next three for Cell_line2.
> print(example) Cell_line1_time1 Cell_line1_time2 Cell_line1_time3 Cell_line2_time1 Cell_line2_time2 Cell_line2_time3 gene1 -1.978590 -2.019766 0.9195353 0.8658088 1.0745280 1.0391918 gene2 -1.990042 -1.926219 1.2212178 1.0817830 0.8957532 0.8291394 gene3 -2.052207 -1.944479 0.1934400 0.1990469 1.3369705 1.1660817 gene4 -1.962848 -1.929381 1.1132417 1.0928312 0.9362451 0.9592930 gene5 -2.102807 -1.894731 0.6807321 0.7203357 1.2593390 1.0988770 gene6 -1.967632 -1.946733 1.0209805 0.9119801 1.0010454 1.1305777
The idea behind merging these was that if the patterns are the same they will just mirror each other in the matrix like two cycles of gene expression, but for opposite behaving genes the second half of the matrix will behave differently. I tried various clusterings, k-means seems to be the most interesting (visually. The number of clusters for all clusterings has been a solid guesstimation by looking at the dendogram for hierarchical or the heatmap for k-means. All the other elbow, silhouette and gap methods don't seem to converge on something meaningful, but there definitely is structure in the data). K-means identifies clusters of genes that behave the same way and a single cluster of genes that has completely opposite patterns - exactly what my ideal goal was. But what remains unanswered is if this is an actually valid approach or did I just hack my way through to something cool that is based on invalid assumptions?
If anyone can confirm this OR has other ideas on how to approach this please do share, I'll take any suggestions and criticism on board.
Would a more sound way of tackling this be to cluster them separately and then calculate some similarity between the clusters and have a cutoff what I consider similar, like with Jaccard index (I am just name dropping things I saw on google).