Question

Adding Normally distributed Noise in data Frame

1

Entering edit mode

9.4 years ago

adnanjaved1988 ▴ 80

Hey All I Want to generate in silico replicates for my samples. The Expression values from 5 samples are stored in data frame I wrote one function which is adding noise into my data which is quite reasonable but when I look at the summary of Replicates the median values for replicates are bit high than original.

> Summary(noise1)
       A                  B                  C
 Min.   :   13.86   Min.   :   12.09   Min.   :   12.37
 1st Qu.:  104.79   1st Qu.:  129.96   1st Qu.:   97.08
 Median :  177.39   Median :  205.46   Median :  177.97
 Mean   :  434.74   Mean   :  631.65   Mean   :  424.75
 3rd Qu.:  246.47   3rd Qu.:  309.99   3rd Qu.:  249.52
 Max.   :29194.97   Max.   :34088.19   Max.   :35463.34
       D                  E
 Min.   :   17.77   Min.   :   18.4
 1st Qu.:  130.78   1st Qu.:  115.9
 Median :  210.35   Median :  191.9
 Mean   :  551.05   Mean   :  346.9
 3rd Qu.:  294.92   3rd Qu.:  265.0
 Max.   :29059.16   Max.   :16107.5

My Original Data

 Summary(d1)
       A                   B                  C
 Min.   :    3.831   Min.   :    5.33   Min.   :    4.58
 1st Qu.:    9.248   1st Qu.:   22.50   1st Qu.:    9.55
 Median :   19.387   Median :   48.73   Median :   18.49
 Mean   :  306.192   Mean   :  507.19   Mean   :  298.91
 3rd Qu.:   62.902   3rd Qu.:  164.72   3rd Qu.:   68.70
 Max.   :29062.144   Max.   :33955.38   Max.   :35251.71
       D                   E
 Min.   :    5.454   Min.   :    5.454
 1st Qu.:   18.021   1st Qu.:   20.747
 Median :   40.747   Median :   32.089
 Mean   :  419.879   Mean   :  217.255
 3rd Qu.:  138.000   3rd Qu.:   81.630
 Max.   :29010.870   Max.   :15940.595

Function:

fc<-data.frame(d1)
n<-data.frame(fc)

addNoise <- function(mtx)
{
 # if (!is.matrix(mtx)) mtx <- matrix(mtx, byrow = TRUE, nrow = 1)
  random.stuff <- matrix(runif(prod(dim(mtx)), min = 5, max = 250), nrow = dim(mtx)[1])
  random.stuff + mtx
}

noise1<-addNoise(mtx = n)

Second Question

If I want to add noise by first talking row mean for each row and then add normally distributed noise in a reasonably small interval can any body suggest me how I can I do that?

Best

Adnan Javed

R • 13k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by adnanjaved1988 ▴ 80

0

Entering edit mode

look at your data before you do this. It's very skewed. Wouldn't it be more appropriate for you to multiply the geometric mean of your data by a lognormal variable?

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by russhh 5.7k

Ram · Answer 1 · 2014-12-08

3

Entering edit mode

9.4 years ago

Devon Ryan 104k

If you want normally distributed noise, then use rnorm() rather than runif(), which is noise from a uniform distribution.
Just convert to a matrix and apply() a function like:
```
function(x) {
  m <- mean(x)
  rnorm(some_number, m, 0.1*m)
}
```
The result will be some_number of replicates with the same row means but with normal noise added.

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by Devon Ryan 104k

0

Entering edit mode

Hey Devon thanks for your reply but if I will use rnorm then I won't be able to add range (min and max) because min value what I want to add is 5 and maximum which should be added 250

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by adnanjaved1988 ▴ 80

1

Entering edit mode

If you want to have a minimum and maximum, then it's not 'normal' noise, because a bellcurve has no minimum or maximum. If you want to add between 5 and 250, then of course your mean and median will increase, youve added only positive values.

So, what do you want to do? It sounds like no error exists and everything is working. Devon's answer is correct.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by karl.stamm 4.1k

0

Entering edit mode

Hey what I want to do is to generate in silico replicates for my data set. By adding normally distributed noise if I will go with the option of Devon then the max value in each sample is exceeded like in original sample max value is around 35000

But by using that method

summary(noise1)
       A                  B                  C                  D
 Min.   :   10.21   Min.   :   10.69   Min.   :   10.32   Min.   :   10.25
 1st Qu.:   29.42   1st Qu.:   41.70   1st Qu.:   28.95   1st Qu.:   37.59
 Median :   53.09   Median :   80.84   Median :   51.01   Median :   72.72
 Mean   :  654.68   Mean   :  861.66   Mean   :  643.63   Mean   :  769.64
 3rd Qu.:  177.72   3rd Qu.:  283.56   3rd Qu.:  186.00   3rd Qu.:  259.88
 Max.   :48738.73   Max.   :60878.37   Max.   :58708.16   Max.   :47691.94
       E
 Min.   :   11.30
 1st Qu.:   40.13
 Median :   65.26
 Mean   :  578.18
 3rd Qu.:  200.18
 Max.   :41616.14
> summary(d1)
       A                   B                  C                  D
 Min.   :    3.831   Min.   :    5.33   Min.   :    4.58   Min.   :    5.454
 1st Qu.:    9.248   1st Qu.:   22.50   1st Qu.:    9.55   1st Qu.:   18.021
 Median :   19.387   Median :   48.73   Median :   18.49   Median :   40.747
 Mean   :  306.192   Mean   :  507.19   Mean   :  298.91   Mean   :  419.879
 3rd Qu.:   62.902   3rd Qu.:  164.72   3rd Qu.:   68.70   3rd Qu.:  138.000
 Max.   :29062.144   Max.   :33955.38   Max.   :35251.71   Max.   :29010.870
       E
 Min.   :    5.454
 1st Qu.:   20.747
 Median :   32.089
 Mean   :  217.255
 3rd Qu.:   81.630
 Max.   :15940.595

Is it possible I can generate in silico replicates with max and min values in the range of original data?

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by adnanjaved1988 ▴ 80

1

Entering edit mode

With Devon's solution, the width of the normal noise (standard deviation of the bell curve) was set equal to ten percent the mean. That's probably a reasonable scalar so your small values don't move too much, and it explains why the largest had the most fluctuation.

If you want to cap the total, just divide the final results by a factor. If you multiply item A by about half (26/48) your resulting max will be at the original point, but the rest of the data will get a little smashed. Again, it comes down to what you want to do. Simulating insilico data can mean anything, and depending on your goals there's so many methods. The noise you add here won't match the kinds of noise created by the experiment anyway.

Is it really so bad that column B's max has doubled?

Another way I have made fake replicates is by going back to the raw data and only using half of it. You can compute coverage of NGS reads on a subset of the data over and over, and get realistic randomness regarding the small values' sampling error, but the major values won't change much.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by karl.stamm 4.1k

0

Entering edit mode

karl.stamm has replied with everything that I would have said. One thing that I'll add is that the "0.1 times mean" standard deviation was just an example, I wouldn't literally do that! A better solution would be to first determine if there's a mean-dispersion relationship (if this is starting to sound like voom(), then there's a reason for that) and, if so, use it as the input in rnorm(). If not, just use whatever constant fits as a scalar. Of course sometimes the max/min will exceed the bounds of the original data...that's a feature, not a bug.

ADD REPLY • link 9.4 years ago by Devon Ryan 104k

score 0 · Answer 2 · 2014-12-09

Thanks both of you This helps a lot.. ;)

@Karl this is Micro array data and yes before this clarification I was bit confused why after normal distribution of noise So samples exceeds bounds of max/min values when comparing when Original data. Yes Devon you are absolutely right its a feature not a bug.. I agree... :)

Ram · Answer 3 · 2014-12-09

0

Entering edit mode

9.4 years ago

adnanjaved1988 ▴ 80

One more Question: As I want to generate in silico Replicates of my data. Do I need to use normalize data or raw data for this purpose.(raw data means before Quantile Normalization). After generating replicates I will normalize them?

What do you suggest. If I will generate replicates of normalize data (using normalizeQuantiles() function then all replicates would have same median,first,third Quartile and mean..and If I will firstly use the raw data and then normalize I would be doing the same thing. At the end I will have same first,third Quartile and mean after normalization

My Question is would it make any difference in values if I will be using normalize sample for generating replicates? Would it has any influence on values which I am generating?

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by adnanjaved1988 ▴ 80

0

Entering edit mode

Sorry for the delay, I was out of town. In the grand scheme of things it doesn't much matter if you generate pre- or post-normalization samples. If you generate the replicates from pre-normalized data then you'll need to normalize them along with the real samples. Realistically, just do whichever seems simpler, since the results should be essentially the same.

BTW, I don't think any of us have asked you exactly why you want to do all of this...though we probably should have. Just keep in mind that generating fake samples like this is often inappropriate, so use with caution!

ADD REPLY • link 9.4 years ago by Devon Ryan 104k

Ram · Answer 4 · 2014-12-10

Devon you are right generating fake samples is of no use. I know. But I don't have replicates for my data. After generating in silico replicates I want to perform

ANOVA test again for my samples (original and replicates)

hierarchical clustering with assessing the significance of the clustering

support vector machine analysis including only one replicate for each sample.

Then In future I can use those R scripts for same samples having replicates.