Question: ANOVA and Principal Component Regression
gravatar for adnanjaved1988
6.2 years ago by
adnanjaved198860 wrote:

I Just need your valuable suggestions.

This is how my data frame look like. This is Back ground Subtraction values from 5 samples of micro array data.

  • A is parent sample.
  • B C D E they are treatment. Among treatments B is the sample which is resistant to drugs applied on it.

I have no duplicates of miRNAs in 5 samples so instead of writing miRNAs names for every sample I just them once. So 5 samples have 2019 rows and and each row represents miRNAs but the values of samples in front of that miRNAs different for each sample. They are expression values.

                                          A        B         C         D
hsa-miR-199a-3p, hsa-miR-199b-3p         NA   13.13892  5.533703  25.67405
hsa-miR-365a-3p, hsa-miR-365b-3p   15.70536   52.86558 18.467540 223.51424
hsa-miR-3689a-5p, hsa-miR-3689b-5p       NA   21.41597  5.964772        NA
hsa-miR-3689b-3p, hsa-miR-3689c     9.58696   44.56490 10.102051  13.26785
hsa-miR-4520a-5p, hsa-miR-4520b-5p 18.06865   28.06991        NA        NA
hsa-miR-516b-3p, hsa-miR-516a-3p         NA   10.77471  8.039662        NA
hsa-miR-199a-3p, hsa-miR-199b-3p         NA
hsa-miR-365a-3p, hsa-miR-365b-3p   31.93503
hsa-miR-3689a-5p, hsa-miR-3689b-5p 24.26073
hsa-miR-3689b-3p, hsa-miR-3689c          NA
hsa-miR-4520a-5p, hsa-miR-4520b-5p       NA
hsa-miR-516b-3p, hsa-miR-516a-3p         NA

For Anova I reshaped my data frame into:

reshape package
melt function
                                 MiRNAs                Group    value
1                  hsa-miR-199a-3p, hsa-miR-199b-3p     A       NA
2                  hsa-miR-365a-3p, hsa-miR-365b-3p     A 15.70536
3 hsa-miR-3689a-5p, hsa-miR-3689b-5p, hsa-miR-3689e     A       NA
4                   hsa-miR-3689b-3p, hsa-miR-3689c     A  9.58696
5                hsa-miR-4520a-5p, hsa-miR-4520b-5p     A 18.06865
6                  hsa-miR-516b-3p, hsa-miR-516a-3p     A       NA

2019 miRNAs for sample A

2019 miRNAs B and so on. By using:

ANOVA1<-aov(m$value~m$Group) and they TukeyHSD 

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = m$value ~ m$Group)

          diff        lwr       upr     p adj
B-A   73.87304  -88.20262 235.94869 0.7256734
C-A  -25.55832 -196.36413 145.24749 0.9941714
D-A  203.80312   20.26110 387.34514 0.0207431
E-A   41.04993 -159.09661 241.19648 0.9807637
C-B  -99.43136 -258.28853  59.42581 0.4290920
D-B  129.93008  -42.54789 302.40805 0.2398572
E-B  -32.82310 -222.87472 157.22851 0.9899165
D-C  229.36144   48.65517 410.06771 0.0048776
E-C   66.60826 -130.94103 264.15755 0.8892989
E-D -162.75319 -371.41264  45.90627 0.2081150


My Question is, do I need to perform ANOVA with Control Vs treatment? or the way I performed is correct?  How I can perform Principal Component Regression for this data?

R • 1.9k views
ADD COMMENTlink modified 4.6 years ago by Biostar ♦♦ 20 • written 6.2 years ago by adnanjaved198860

What question are you trying to answer with this data? It's highly unlikely that the Anova you performed will correctly answer any biological question you'd be interested in asking.

ADD REPLYlink written 6.2 years ago by Devon Ryan98k

My Question is, do I need to perform ANOVA with Control Vs treatment? Or the they way I performed is correct. Do I need to exclude control and just check variablity amoung treatments which are most significant.

A is Control and B C D E are treatments.. B is resistant to drugs which are used for treatments

ADD REPLYlink modified 6.2 years ago by Devon Ryan98k • written 6.2 years ago by adnanjaved198860

If B, C, D and E are different treatments, which I assume is the case given what you've written, then you can't do an ANOVA (and the one you showed makes absolutely no's not even testing something coherent). Perhaps you can get limma to estimate dispersions in a group-blind manner and then use that in its linear model...but I expect the results will still be crappy. To be frank, you're largely wasting your time with this dataset.

ADD REPLYlink written 6.2 years ago by Devon Ryan98k

I got curious... Why do you say that anova makes no sense?

aov(m$value~m$Group) tests whether any of the "Group" means in miRNA value is different from another. Tukey's test says that D vs A and D vs C are different.

(Maybe the model could be improved by nesting the error since there are treatments within miRNA, but I don't see it as non-sensical; also, whether it makes biological sense I don't know).

If on the other hand adnanjaved1988 is interested in which miRNA are different than, yes, there is no way to go about it as there is no replication within treatments.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by dariober11k

It makes no biological sense and is therefore nonsensical. Further, the background distribution is likely not even remotely gaussian and how should one even deal with the NAs in the dataset (aov will just remove them...but that's not fair in this case since we need to know why things are NA). In the unlikely event that they're looking at, say, a dicer knockout or some other knockdowns of various components of the miRNA processing machinery then asking generally about miRNA changes becomes more interesting. Then, however, the errors would need to be nested (or better yet, simply a different test measuring AUC for the miRNA peak on a bioanalyzer on multiple samples and then doing statistics on that).

ADD REPLYlink written 6.2 years ago by Devon Ryan98k

Actually, the main problem is that for each treatment there is just one sample analyzed. Although each sample is "measured" ~2000 times all you can say is that, e.g., sample D is different from C but you can't generalize to saying "Treatment D != C" since the difference might be due to that particular sample prep or the array etc. This strongly limits (invalidates?) the biological relevance of the analysis, I agree.

About NAs, I would be less worried if they are sparse and random (which might be the case?) and non-normality might be curable.

@adnanjaved1988, for the record nesting can be specified like aov(m$value~m$Group + Error(m$Group / m$MiRNAs)); but again, be careful about the interpretation.

ADD REPLYlink written 6.2 years ago by dariober11k

Excellent point.

ADD REPLYlink written 6.2 years ago by Devon Ryan98k

 Hey  Dariober Thanks for your comment

                   The main purpose for this study is to see miRNA expression level with the treatments applied. These samples are from the patients of institute where I am working and they want to see miRNAs expression level in exosomes of breast cancer patients.

The Array they used had 2019 miRNAs So they want to see which treatment (combination of Drugs) causes differential expression of those miRNAs.

As I have parent cell line which shows their normal expression and when they applied drugs on other cell line definitely the expression among other groups changed,Some showed significant high  change of expression.. So I am doing these tests to see which group is changed from which group.

My main role is to use specific miRNAs from my data set as a Biomarker for cancer identification. and with no offense

I don't know why it makes no sense for  Mr  Devan Reyan. By the way in many posts I saw him writing this sentence (It makes no sense) :D anyways

Best Adnan


ADD REPLYlink written 6.2 years ago by adnanjaved198860

Hey Thanks Dariober :)

 Can you suggest me How I can improve my model by nesting the error

ADD REPLYlink written 6.2 years ago by adnanjaved198860
gravatar for adnanjaved1988
6.2 years ago by
adnanjaved198860 wrote:
For handling NAs I firstly removed those rows of my data frame where there were 5 NAs. and 4 NAs.

 For handling rest of NAs in rows I used three methods and see what are the differences which I will get by using these methods

assigning row means ( which is OK but not very potential because you are not getting new Information.

 a mirna which is overall 
highly correlated with the mirna having the missing value and taking a 
value derived from that mirna. example below...

miRNA-1 values: 1 2 3 NA 5
miRNA-2 values: 2 4 6 8 10
==> replace the missing value derived from the second miRNA by 4

One method is "K-Nearest neighbours (KNN impute)" method for imputation to deal with the NA values.
ADD COMMENTlink written 6.2 years ago by adnanjaved198860
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1748 users visited in the last hour