Question

What is the best way to measure similarities / correlations of gene expression time series curves to define commonly behaving regulatory units / groups?

2

Entering edit mode

8.5 years ago

tfhahn ▴ 50

I am trying to figure out whether there is any relationship between the expression time series curve of a gene and its function. For this purpose I plotted time series plots for over 5,000 yeast genes. Unfortunately, when using the regular commonly used kind of Pearson correlations the plots get normalized by default regardless whether the expression of a gene changes a lot or not at all. That is why I am now looking for a different kind of correlation function, which does not automatically normalize before calculating correlations / similarities, but - which instead - is based on the absolute values of each curve. That way I hope to get fewer highly correlated curves even when the genes don't seem to have anything to do with one another. I only want to curves to come out as highly correlated if not only their relative shape - but especially their absolute values are also very similar. What kind of comparing function should I use for this purpose? I'd prefer R since I generated the attached plots in R. My plots show the same as those from the publication, from which I took the microarray gene expression data. I was using the following study:

"Global control of cell-cycle transcription by coupled CDK and network oscillators" by

David A. Orlando, Charlenter link description herees Y. Lin, Allister Bernard, Jean Y. Wang, Joshua E. S. Socolar, Edwin S. Iversen, Alexander J. Hartemink & Steven B. Haase doi:10.1038/nature06955 (http://www.[enter link description here][2]nature.com/nature/journal/v453/n7197/edsumm/e080612-19.html)

I was able to replicate their results although I intentionally did not normalize because I feel normalizing is cheating and treating some genes unfairly.

My results look similar to theirs in figure 2 (see http://www.nature.com/nature/journal/v453/n7197/fig_tab/nature06955_F2.html#figure-title)

They identified 6 genes that oscillate particularly strongly with the cell cycle and can therefore be considered as cell cycle drivers. Those genes are CLN2, RNR1, SIC1, NIS1, CDC20 and ACE one (see text below figure 2). Also in my attached plots the same genes have the highest variance. But their are many genes with punitive and unknown functions, which seem to cycle with them and which could therefore be considered as regulated by the same mechanism. .

Now I am looking for a way to measure the similarities / correlations between these 6 cell cycle driver genes and the rest. That way I hope to define regulatory units. But I don't want normalization because it sets the fluctuation (i.e. difference between minimum and maximum for each time series curve, equal to each other even if they are not. I tried correlation based on normalization but then I could not find any GO-Term enrichment for the many seemingly correlated genes, whose trajectories were more than 0.85 correlated despite having completely unrelated functions. I want an absolute correlation, where only trajectories that would almost lie on top of each other based on their absolute values, but not their relative values, i.e. not based upon the overall shape of the curve after it has been normalized, get a high correlation score. I am not sure whether this approach well definitely work better than our many failed attempts to master controlling the aging process to the point were we can effectively and permanently reverse it so that old age and death could no longer threaten us anymore. What is an R function that can calculate me such an absolute instead of a normalized-based correlation?

My hypothesis is that when the yeast is still young the genes belonging to the same regulatory units are very well coordinated and work together. But as the yeast ages this synchronicity is gradually lost. This interferes with the proper functioning of each regulatory unit to the point where it causes aging and death if the genes of a regulatory unit don't work together at all anymore.

My hypothesis is that if we do the same experiment with old yeast cells much fewer genes would follow with a much smaller magnitude their cell cycle gene leaders. Here is the link to the microarray dataset, which I have analyzed: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE8799.

The red and blue lines in my attached time series plots are CDK mutants, which cannot properly perform the cell cycling anymore. But, for us the important part for determining which genes we'd consider as members of the same regulatory unit / group / GO-Term (Gene Onthology) are the 2 Wild Type replica, which are shown by a light green and light blue line in my attached time series plots for over 5,100 yeast genes. .

Since the magnitudes of the cycling can take up to more than 50% of the range it can become also much clearer now why we could not observe any clear linear trends when looking at the other datasets with measurement throughout the entire lifespan of the yeast because the gaps between them are exceeding the time for one cell cycle, i.e. 2-4 hours maximum. The time on the plots X axis s given in minutes.![enter image description here][3] Thomas

At the end of this text I have inserted the link to my time series plots and the Nature article plus supplements. Now I need help in figuring out how to make co-expression and regulatory networks from these time series plots. From visual inspection it seems to me that the time series plots of genes, which belong to the same GO-terms don't appear to be any more correlated to one another than they are to all the remaining genes. But as far as I understand this is the basis on which co-expression networks are build. Am I understanding things wrong here?

If you can help with any answers to these questions or explanations or materials I would be very thankful because I somehow need to solve these problems before I can get my degree but I am only very slow in googling things since I am legally blind. But when I know which text is important I can listen to it.

So please reply to me via email at <censored> or via Skype to my Skype ID, which is <censored>. Thank you so very much in advance.

Thomas

R genome networks • 3.7k views

ADD COMMENT • link updated 7.9 years ago by Sirus ▴ 820 • written 8.5 years ago by tfhahn ▴ 50

0

Entering edit mode

Using personal communication such as email or skype is discouraged since everyone in the community can benefit from your question and contribute to the answers in a valuable discussion which can still be interesting and relevant for other people with the same issue. I censored your personal information. It's also not a great idea to share your personal information everywhere, but that's up to you.

ADD REPLY • link 8.5 years ago by WouterDeCoster 48k

0

Entering edit mode

My text to speech software, on which I am depending to access electronic information because I am legally blind, can read much better in Gmail than in this interface. Here I need to worry to accidentally miss replies.

ADD REPLY • link 8.4 years ago by tfhahn ▴ 50

0

Entering edit mode

My apologies. In that case, feel free to add email and skype again.

ADD REPLY • link 8.4 years ago by WouterDeCoster 48k

score 5 · Answer 1 · 2017-02-06

A standard way of measuring distance between time series is with dynamic time warping. This was first applied to gene expression data in this paper. There's a JAVA program, GeneTxWarper, implementing classical DTW for gene expression data available here.
Several variants of the algorithm are implemented in R in the package dtw.

score 1 · Answer 2 · 2017-09-10

I think you can get inspired by single-cell methods to detect differentially expressed gene along the pseudo-time. Different techniques have been developed to detect cyclic genes (here) and genes that show a significantly differential expression along the pseudo-time (such as here and here )

Some people (such as here) use gaussian process to detect DE genes. For example in this paper, they fit two Gaussian processes for each gene, one representing the white-noise model ( the Null Hypothesis) and an alternative model represented by a radial basis kernel + noise, they selected the DE genes according to the likelihood ratio of the two models.

There are also some R packages used to find gene fluctuation patterns such the clues package (which basically just do clustering with no p-values) or the EBSeqHMM package which is based on a hidden Markov model and can classify genes according to their change pattern (but the computation is slow)