Question: Performing Anova Test For Ngs Time Series Datasets
gravatar for
5.0 years ago by
bioinforupesh2009.au100 wrote:

Dear all,

May be its very basic question for you all. I would like to perform 2 way ANOVA test for my NGS time series counts data. It is raw data counts by HT-Seq. I have 2 condition and for each corresponding diff. time points and their counts. Suggestion is to perform 2 way anova test. but problem is, i don't know the better way to prepare file for that in order to use R function like aov(). Help me to do so....

My raw counts (without normalization) like..

gene_name counts_value (for each condition and corresponding time points (36 sample in total) )

gene_1  2349
gene_2  3462380
gene_3  0
gene_4  123


  1. How to prepare (format of input) input file in order to use R functions
  2. is there any package (bioconductor or R) to perform this analysis
  3. may i have to normalize this raw counts before performing 2 way ANOVA
ADD COMMENTlink modified 5.0 years ago by Charles Warden6.1k • written 5.0 years ago by bioinforupesh2009.au100
gravatar for Devon Ryan
5.0 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

Given that these are raw counts, you really really don't want to use aov() or any other Anova test. What you want is to use a generalized linear model, which will respect the fact that these are integers rather than floats. Please have a read through the DESeq2 vignette, as that has facilities to directly deal with counts produced by htseq-count and is, in fact, written by the same authors.

ADD COMMENTlink written 5.0 years ago by Devon Ryan88k

Thank you sir for your kind advice but before analyzing differential expression gene analysis, I just wanted to use 2 way ANOVA test in order to investigate and model the relationship between a response variable and predictor variables. Just short of preliminary analysis.

and About DESeq2, indeed i am used to with this amazing package, not for time series analysis but for another standardized experiments (2 conditions and their replicates).

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by bioinforupesh2009.au100

You still don't want an ANOVA then. ANOVA's require continuous variables, which you don't have (you have integer counts). You could read.delim() the count files in, cbind() those to create a matrix (make sure to remove the last 5 lines) and then use glm.nb() from MASS. You could instead use a glmm if you really wanted. Either of those would be better ideas than an ANOVA.

ADD REPLYlink written 5.0 years ago by Devon Ryan88k
gravatar for Charles Warden
5.0 years ago by
Charles Warden6.1k
Duarte, CA
Charles Warden6.1k wrote:

I agree with dpryan79's concern about counts, but I think your strategy should be OK if you are working with normalized expression (like RPKM/FPKM). Once you use a tool like cufflinks, RSEM, etc., you should be OK. I think you can also use DESeq to produce normalized expression values, but that's not typically what I do. This is similar to what I do in the sRAP package:

If your institution has a license, you could use the RSEM-like mRNA quantification tool in Partek and pretty much follow the same workflow as described in this paper (except the pairing variable specifies timepoints instead of tumor-normal pairs):

That said, your ANOVA-based design will be asking what genes consistently change across time points. If you are trying to ask a different sort of question, you may want to take a look at some of the tools available in the timecourse package:

ADD COMMENTlink written 5.0 years ago by Charles Warden6.1k

Thank your sir for your kind suggestion to use diff packages in R. i will surely check. But for your first para i can say, i have raw counts from directly after mapping. Moreover, this is not a RNAseq instead miRNA short reads. So i am not suppose to use RPKM/FPKM normalized expression counts. However, i am still confused that may i have to normalization of this raw counts before ANOVA or Negative Binomial Generalized Linear Model (as suggested by dpryan79).

Thank you once again for your time and advice

ADD REPLYlink written 5.0 years ago by bioinforupesh2009.au100

You don't need to normalize for gene length, but you do need to normalize for the number of reads per library.

For miRNA-Seq data, I would use CPM (count per million reads).

ADD REPLYlink written 5.0 years ago by Charles Warden6.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1921 users visited in the last hour