Clustering two different RNA-seq timeseries results to identify shared/opposite gene patterns?
1
0
Entering edit mode
5 weeks ago
kelen ▴ 160

Hi!

I would really appreciate a second opinion on something I am attempting to do or perhaps someone can argue against this and direct me towards something more applicable. ( I did dig up this not-very-related post, where this was suggested in the comments Combining Microarray and RNAseq data).

Data: 2 x timeseries RNA-seq datasets. (Cell_line1 and Cell_line2).

Background: I am trying to capture what makes the cells behave the same way across the timeseries and whether there is something drastically different. Basically comparing gene expression trajectories.

Analysis: Analyzed as separate datasets in DESeq2 using LRT to obtain significantly varying genes across the timepoints.

design(Cell_line1) <- ~ condition

Cell_line1_dds <- DESeq(Cell_line1, test="LRT", reduced = ~1)

Result: ~10000 DE genes in Cell_line1, ~5000 DE genes in Cell_line2. ~4000 overlapping DE genes. The first intuitive step to look at differences was to look at non-shared DE genes and characterize these through ORA. However, I noticed that among the 4000 overlapping DE genes there are a few that behave in completely opposite patterns in the cell lines. E.g. in cell_line1 they are up-regulated and in cell-line2 down-regulated.

Question: I wanted to have a more systematic way of identifying among these 4000 shared DE genes the ones that behave similarly between the cell-lines and the ones that do not. So I tried clustering. As these datasets were analyzed separately the best I could come up with in order to compare the expression patterns between them was to subset the VST-transformed and Z-score scaled data-matrices for these 4000 overlapping DE genes, merge these into one matrix, and use for clustering. So the subset of the 4000 genes for Cell_line1 and Cell_line2 were merged into one large matrix. I made a pseudo example of what that looks like, so the first three columns are the scaled VST counts for Cell_line1 and the next three for Cell_line2.

> print(example)

Cell_line1_time1 Cell_line1_time2 Cell_line1_time3 Cell_line2_time1 Cell_line2_time2 Cell_line2_time3
gene1        -1.978590        -2.019766        0.9195353        0.8658088        1.0745280        1.0391918
gene2        -1.990042        -1.926219        1.2212178        1.0817830        0.8957532        0.8291394
gene3        -2.052207        -1.944479        0.1934400        0.1990469        1.3369705        1.1660817
gene4        -1.962848        -1.929381        1.1132417        1.0928312        0.9362451        0.9592930
gene5        -2.102807        -1.894731        0.6807321        0.7203357        1.2593390        1.0988770
gene6        -1.967632        -1.946733        1.0209805        0.9119801        1.0010454        1.1305777


The idea behind merging these was that if the patterns are the same they will just mirror each other in the matrix like two cycles of gene expression, but for opposite behaving genes the second half of the matrix will behave differently. I tried various clusterings, k-means seems to be the most interesting (visually. The number of clusters for all clusterings has been a solid guesstimation by looking at the dendogram for hierarchical or the heatmap for k-means. All the other elbow, silhouette and gap methods don't seem to converge on something meaningful, but there definitely is structure in the data). K-means identifies clusters of genes that behave the same way and a single cluster of genes that has completely opposite patterns - exactly what my ideal goal was. But what remains unanswered is if this is an actually valid approach or did I just hack my way through to something cool that is based on invalid assumptions?

If anyone can confirm this OR has other ideas on how to approach this please do share, I'll take any suggestions and criticism on board.

Would a more sound way of tackling this be to cluster them separately and then calculate some similarity between the clusters and have a cutoff what I consider similar, like with Jaccard index (I am just name dropping things I saw on google).

Yours truly,

RNA-seq trajectories timeseries clustering • 273 views
1
Entering edit mode
5 weeks ago
swbarnes2 9.8k

Another thing you could do would be to make the time points numerical, and then running DESeq will get you the slope of the line as compared to the logged normalized counts. You can do an interaction to get the difference in slopes between cell1 and cell2. This will get you the genes which change differently over time between the cell lines.

0
Entering edit mode

Thanks for taking the time to respond!

I think I roughly understand what you mean, but does that include making specific assumptions about the time? The time-points are not equally spaced or they are more 'pseudo' than '2h, 4h, 6h'. They are sorted based on the cyclic expression of marker-proteins.

I might attempt what you suggest and see if that gives me anything different for that set of genes. Can I just confirm, that what this approach would essentially do is focus on genes where the direction of change across numerical time-points is significantly different between the Cell-lines and disregard genes that change the same way? I have a feeling it would then also include the few thousand genes in Cell_line1, that are uniquely expressed in those cells and it will be harder to pick out opposite expression patterns from differences because of unique expression. I guess I won't know before I try.