Question: How to select top ranking genes with variance from a time point experiment using R or Excel
0
gravatar for herman.pappoe.45
3.4 years ago by
United States
herman.pappoe.4510 wrote:

Hello everyone,

I have RNA seq data of human cardiomyocyte samples collected at 5 different time points of the development of the cells (i.e. Day0, Day2, Day5, Day15, Day 30). The model is hence a directed differentiation system. I am using a file with normalized RPM counts for each transcript ID from a previous transcriptome quantification step(with Cufflinks). I eventually plan on "grepping" these transcript IDs to the corresponding Gene_IDs. What I essentially have is a matrix with cuff.IDs and gene expression values for 5 columns representing the time points. I want to essentially build a gene regulatory network that encapsulates the differentiation process in our cardiomyocyte samples. I want to use genes that are constantly differentially expressed throughout the differentiation time-points. I was thinking about approaching this by running a differentially expressed gene analysis of each time point in development against Day0, sort of using Day 0 as the control. I would then select those genes that remain differentially expressed in all comparisons Day0-2, Day0-5, Day0-15, Day0-30. My intention was to perhaps rerun DESEQ2 in R in this manner. However, when I mentioned this idea to my PI, I was told that I could instead approach the matter by calculating the covariance among the samples and then ranking the genes and selecting the top few genes using EXCEL. I have no idea how to approach this using EXCEL. I am completely inexperienced in bioinformatics, programming, statistics and I barely used a PC until 5 months ago. I would appreciate it if I could get a step by step tutorial to how approach my issue using EXCEL for my specific project. I am aware there are many tutorials out there but none are clear and are rather causing more confusion for me. For example when I calculate the covariance among two lists of genes it results in only one value. What can I do with this covariance value in excel, in order to successfully rank the genes by covariance? 

My supervisor instructed me to use R to get these results. However, I am terrible with R. I cannot even figure out which function to use to read the file. read.table is giving some issues. This is the command line that my supervisor advised to use to obtain variance from list:

topVarGenes <- head(order(rowVars(data[,2:6]),decreasing=TRUE),15)

gene_lists <- cbind(data[topVarGenes,], rowVars(data[topVarGenes,2:6]))

write.table(gene_lists,file='topVarGenes.txt',quote=FALSE,sep="\t")

###So the rowvars are calculating the covariance and order and ranking them.

The above is just not working. I think it might have to do with how I loaded the data, but I am so inexperienced in R, I am not certain what the issue is. I am speculating maybe it should be data.frame. It would be much obliged if I could get a step by step R command line to get the results I need.

Also, if I wanted to instead run a coVariance against Day 0 for all samples how would I modify the command line?

I know I have asked a lot of questions and I am very grateful in advance to whoever takes the time to respond.

 

 

 

 

 

 

 

excel time-point rna-seq R variance • 2.2k views
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by herman.pappoe.4510

Please can you add sample data and the output.

ADD REPLYlink written 3.4 years ago by lmanohara9920

Yes  I can, which output are you referring to?

 

ADD REPLYlink written 3.4 years ago by herman.pappoe.4510
0
gravatar for herman.pappoe.45
3.4 years ago by
United States
herman.pappoe.4510 wrote:

This is what the excel file is like. By using the covariance function in excel for Day 0 and Day 2 for example, I get a single resulting value. I simply selected the covariance.s function and highlighted the cells for Day0 as the first array and the cells for Day 2 as the second array and clicked done. In a separate cell I get one resulting value from this function. Besides the fact I get an error saying Formula Omits adjacent cells, I am not clear on how to use this covariance.s function in excel to rank all the transcript IDs. What should I do with the resulting value? Or is there a better approach to rank the "genes" (a.k.a. cuff.IDs/ transcripts) by covariance using excel? 

 

 

 

 

DAY 00 DAY 02 DAY 05 DAY 15 DAY30
  0 2 5 15 30
CUFF.ID 0 0 2.297569688 0.876671707 4.140347772
CUFF.ID 2.626527804 0 9.19027875 8.766717072 4.140347772
CUFF.ID 330.9425034 209.1708523 785.7688332 642.6003614 785.6309897
CUFF.ID 799.7777164 440.7528674 1553.922965 1551.708922 2158.156276
CUFF.ID 0 0 0 0 0
CUFF.ID 0 1.067198226 1.531713125 14.02674732 8.280695543
CUFF.ID 1.313263902 0 4.595139375 2.630015122 2.070173886
CUFF.ID 2.626527804 2.134396452 0.765856563 0 0
CUFF.ID 5540.660403 4782.115251 4170.85484 3401.486224 3413.716738
CUFF.ID 23.63875024 34.15034324 13.78541813 23.67013609 19.66665192

 

 

I was also trying to use R to get the same results. I was sent a sample command line from my supervisor to help me with the computation. This is what the file uploaded in R looks like:

> head(data)
                   V2         V3          V4           V5          V6
CUFF.1       0.000000   0.000000    2.297570    0.8766717    4.140348
CUFF.10      2.626528   0.000000    9.190279    8.7667171    4.140348
CUFF.10000 330.942503 209.170852  785.768833  642.6003614  785.630990
CUFF.10001 799.777716 440.752867 1553.922965 1551.7089220 2158.156276
CUFF.10002   0.000000   0.000000    0.000000    0.0000000    0.000000
CUFF.10007   0.000000   1.067198    1.531713   14.0267473    8.280696
 

It was loaded as read.table: data <- read.table("file_1") 

I installed matrixStats as a package to perform the following commands with no avail:

> topVarGenes <- head(order(rowVars(data[,2:6]),decreasing=TRUE),15)
Error in head(order(rowVars(data[, 2:6]), decreasing = TRUE), 15) : 
  error in evaluating the argument 'x' in selecting a method for function 'head': Error in `[.data.frame`(data, , 2:6) : undefined columns selected

> gene_lists <- cbind(data[topVarGenes,], rowVars(data[topVarGenes,2:6]))
Error in `[.data.frame`(data, topVarGenes, ) : 
  object 'topVarGenes' not found

> write.table(gene_lists,file='topVarGenes.txt',quote=FALSE,sep="\t")
Error in is.data.frame(x) : object 'gene_lists' not found

**I should add my supervisor also reformatted the data file so that the Cuff.ID column would be row.names or something of the sort.

I have 2 important concerns:

1) I do not understand how this function is computing covariance. Is it pairing adjacent columns? 

2) I want to be able to calculate the covariance and rank the genes ALL against DAY0. Hence Day0-2, Day0-5, Day0-15, Day0-30. Can this function do that? Is there a better approach to my idea?

Thank you very much for your response!

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by herman.pappoe.4510

I think you need to slow down and ask your PI (if he/she knows) to explain what covariance is. If you do a covariance function of Day 0 and Day X, you are basically measuring the relatedness of the days and not the genes, i.e. does Day X depend on Day 0 based on all of your genes.

https://en.wikipedia.org/wiki/Covariance_and_correlation Start with this. Think about how Excel is using these formulae in your application and why you aren't getting the expected result. It's dangerous to use statistics without understanding them, as you will get all sorts of wrong answers.

ADD REPLYlink written 3.4 years ago by Alopex0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1627 users visited in the last hour