Question: Making a Model Matrix for Anova-like analysis
0
gravatar for ygowtha
4.8 years ago by
ygowtha20
United States
ygowtha20 wrote:

Hello All

I have 2 mammalian cell lines: A and B. A is subjected to 3 treatment conditions X,Y,Z. B is subjected to treatment conditions X and Y. So my RNA-Seq data has counts from AX, AY, AZ, BX and BY. I want to perform anova analysis to screen for differentially regulated genes. 

I am trying to analyze my data using edgeR by following the users guide. Since I am completely new to this, it is not very intuitive to me. I want to make a model matrix to perform the anova-like comparison test to screen for genes that are different across all 5 conditions. 

My code is below: Commands are in BOLD.

#count data

                  AX1  AX2 AY1  AY2 AZ1 AZ2  BX1  BX2 BY1 BY2
rna22510      0      0      0      0      0      0      0      0      0      0
rna22511      0      0      0      0      0      0      0      0      0      0
rna22512      0      0      0      0      0      0      0      0      0      0
rna22513      0      0      0      0      0      0      0      0      0      0
rna22514      0      0      0      0      0      0      0      0      0      0
rna22515      0      0      0      0      0      0      0      0      0      0

#I normalize the data and calculate different dispersions. But I feel my main issue is trying to define the matrix to suit my needs.

 

> design<-model.matrix(~0+group, data=cds$samples)
> design

#cds is a DGEList object. 

#output

       groupH_05 groupH_10 groupP_00 groupP_05 groupP_10
P_10 1         0         0         0         0         1
P_10 2         0         0         0         0         1
P_05 1         0         0         0         1         0
P_05 2         0         0         0         1         0
P_00 1         0         0         1         0         0
P_00 2         0         0         1         0         0
H_10 1         0         1         0         0         0
H_10 2         0         1         0         0         0
H_05 1         1         0         0         0         0
H_05 2         1         0         0         0         0

> cds<-estimateGLMCommonDisp(cds,design)

> cds<-estimateGLMTrendedDisp(cds,design)
Loading required package: splines
> cds<-estimateGLMTagwiseDisp(cds,design)
> fit2<-glmFit(cds,design)
> lrt.anova_like<-glmLRT(fit2,coef=2:5)

> lrt.anova_like

An object of class "DGELRT"
$coefficients
       groupH_05  groupH_10  groupP_00  groupP_05  groupP_10
gene1 -12.722907 -12.231696 -11.460983 -11.881307 -11.863651
rna1  -10.723691 -11.950632 -11.671753 -11.126594 -11.035141
rna2   -9.496628  -9.295025  -9.801454  -9.722280  -9.247496
rna3   -8.249089  -8.176574  -8.175494  -8.220004  -8.093095
rna4  -10.855702 -10.826411 -10.816899 -11.020586 -10.697726
12745 more rows ...

> top<-topTags(lrt.anova_like)

> top

Coefficient:  groupH_10 groupP_00 groupP_05 groupP_10 
         logFC.groupH_10 logFC.groupP_00 logFC.groupP_05 logFC.groupP_10
rna111         -26.24556       -19.80748       -21.22361       -21.59235
rna366         -26.24556       -16.69825       -17.77939       -17.69264
rna498         -26.24556       -18.59508       -19.06105       -19.82423
rna735         -26.24556       -17.55049       -26.24556       -26.24556
rna1587        -26.24556       -17.97298       -26.24556       -26.24556
rna2000        -26.24556       -14.89324       -26.24556       -26.24556
gene2467       -26.24556       -23.90012       -19.09330       -26.24556
rna2338        -26.24556       -20.38091       -21.21060       -20.19601
rna2962        -26.24556       -21.82490       -21.59319       -22.53018
rna3147        -26.24556       -19.25034       -18.93527       -21.38125
              logCPM        LR PValue FDR
rna111   -0.76528954  9870.499      0   0
rna366    2.06883064 36241.154      0   0
rna498    0.34196368 16723.335      0   0
rna735    0.31486174  3362.217      0   0
rna1587  -0.06429664  3358.159      0   0
rna2000   2.74046265  2598.921      0   0
gene2467 -0.69390328 13430.350      0   0
rna2338  -0.79058544  2893.519      0   0
rna2962   0.06280098  3613.531      0   0
rna3147   0.05144112 10288.125      0   0

In the above output, the P-value of the toptags is zero and even the FDR is zero which makes me think that my output is wrong. 

I would appreciate any help that you could provide me. 

 

Thanks

 

anova edger rna-seq statistics R • 3.3k views
ADD COMMENTlink modified 4.8 years ago by Devon Ryan88k • written 4.8 years ago by ygowtha20

Your paramaterization is mostly useful for looking at contrasts, which you don't seem to be doing. The p-value will be ~0 for anything that's expressed, which is unlikely to be the question you want to answer.

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Devon Ryan88k

Thank you for your reply Devon. The reason I was worried about p-value is because I have noticed really small p-values but not a zero. Why do you say that I am not doing contrasts, I thought glmLRT(fit2, coef=2:5) was supposed to do that.

Could you also suggest a method/changes that I can make to my code to obtain anova like analysis. Basically I want to look for genes that are differentially expressed without specifying the groups beforehand (like mentioned in the users guide). And it looks like it based on making contrasts too. 

I am very new to programming and bioinformatics, so any help would be appreciated. 

ADD REPLYlink written 4.8 years ago by ygowtha20

I missed the glmLRT(fit2, coef=2:5) line, but that's probably not telling you anything of biological interest. Why don't you just say the biological question that you wish to address?

 

Edit: Actually, I think you need an intercept column for glmLRT(fit2, coef=2:5) to make sense. Otherwise, you probably do end up testing expression.

Edit2: In fact, this is mentioned in the edgeR UsersGuide, section 3.2.6. Regarding a 1-way ANOVA-like test, "... this is done by specifying multiple coefficients to glmLRT, when the design matrix includes an intercept term."

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Devon Ryan88k

Sure. So basically I have two chinese hamster ovary cells lines. I am trying to study the transcriptome changes during serum-free adaptation. Typically 10% serum is added to media to mammalian cell cultures and they have to be adapted to serum-free media. So I identified three different serum concentrations where I want to study RNA of the cells - 10%, 5% and 0%. Only one cell line (A) was able to adapt completely, hence the other cell line(B) has only two different serum concentrations (10%, 5%). 

What I was thinking was to first do an ANOVA analysis to see genes that are differentially regulated since doing a pairwaise comparison initially would generate a lot of false positives. After the anova analysis I would like to do pairwise comparison between meaningful conditions. 

ADD REPLYlink written 4.8 years ago by ygowtha20
3
gravatar for Devon Ryan
4.8 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

As I alluded to above, this is due to not having an intercept. If you simply changed your design to design<-model.matrix(~group, data=cds$samples) then the results should be correct.

Having said that, you have to remember that doing a test and then using that as a filter on a subsequent test of the same dataset is often not statistically justifiable. In general, you're better off just testing things that make biological sense, such as group A at 5% vs group B at 5%. I wouldn't be too worried about the false positive rate, that's the whole point behind compensating for multiple tests. The bigger issue, in fact, is generally decreased power (i.e., more false negatives), which is much of the point behind the genefilter paper that I linked to above.

ADD COMMENTlink written 4.8 years ago by Devon Ryan88k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 852 users visited in the last hour