Excel: using quartiles & averageifs to combine probe values without outliers
2
0
Entering edit mode
4.2 years ago
RNAseqer ▴ 260

Hello all, I am looking for an excel wizard to help me with a messy task: I would like to take the average of values in one column for all those cells sharing an ID in another column BUT with the added stipulation that outliers are excluded based on the use of quartiles.

Here is an example of the spreadsheet:

A       B           C      D     
ABCA5   213353_at   5      5
ABCB1   209993_at   4      4
ABHD5   213805_at   5      38.33333333
ABHD5   213935_at   10     38.33333333
ABHD5   218739_at   100    38.33333333
ACSM3   205942_s_at 1      1000.5
ACSM3   210377_at   2000   1000.5

As you can see Column A has a geneID, column B a probe name (which is irrelevant to this task) and C has the 'expression value, which I have given false values here to make the outliers problem a bit more visually extreme. In column D I have filled in the average values using the following excel code:

=AVERAGEIFS(C:C,A:A,A2 )

I placed that in the first cell of column D then dragged it down to fill in all the appropriate values. What I would like to do now is amend this excel function so that it performs this task but takes advantage of the QUARTILE function to only use values in C to calculate an average if:

  1. The ID in column A for that value matches
  2. It falls within an upper and lower bound defined by 1.5 times the interquartile range.

I'm just having trouble managing all the nested if() functions and conditionals to do this and thought maybe someone with greater excel experience could offer a solution. Many thanks for any help you can provide!

excel outliers quartile averageifs matching • 1.1k views
ADD COMMENT
2
Entering edit mode
4.2 years ago
RNAseqer ▴ 260

Yes... I HAVE had run ins with that scoundrel Clippy before, and am well aware of the problems with excel. Taking your gentle hint I have gone and tried to solve the problem in R. Here is what I've come up with:

Adapting some R code from the links at the bottom I have done the following:

> toy
      A           B    C    D
1 ABCA5   213353_at    5    6
2 ABCB1   209993_at    4    4
3 ABHD5   213805_at    5    6
4 ABHD5   213935_at   10   11
5 ABHD5   213666_at   12   12
6 ABHD5   218739_at 1000   75
7 ACSM3 205942_s_at    1 1000
8 ACSM3   210377_at 2000 1500

Is my data

myFunction3 = function(x, na.rm = TRUE, ...) {
    qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
    H <- 1.5 * IQR(x, na.rm = na.rm)
    y <- x
    y[x < (qnt[1] - H)] <- NA
    y[x > (qnt[2] + H)] <- NA
    mean(y, na.rm = TRUE)
}

Is my function, called by:

newtable <- aggregate(toy[, -c(1,2)],
                      by = list(A = toy$A),          #Here A could/would be the column header for the probe labels
                      FUN = myFunction3,
                      na.rm = TRUE)
newtable

Which results in:

> newtable   
      A      C           D
1 ABCA5    5.0    6.000000
2 ABCB1    4.0    4.000000
3 ABHD5    9.0    9.666667
4 ACSM3 1000.5 1250.000000

Do you see any errors or anything I have missed? If so please let me know. One limitation I'd I might need to overcome is that I'm going to have a fair number of genes with only 3 probes I think, so the outlier criterion I'm using here might need to be changed in the cases where there are say expression values like: 10, 11, 320. Also, in the case of column C for ACSM# there are only two values 1 and 2000 which are obviously wildly different and maybe need to be excluded? But this approach does not detect that potential issue.

From stack overflow: https://stackoverflow.com/questions/4787332/how-to-remove-outliers-from-a-dataset

From Biostars (Is this appropriate linking within the site?)

using R aggregate command/combining affy probes into gene IDs

ADD COMMENT
0
Entering edit mode

For the sake of completeness, in case anyone else digs up this post and finds it useful, here are some additional methods I wound up trying besides the use of quartiles shown above:

Taking the Standard deviation and dropping values more than 2* the Standard deviation from the mean (this method may be problematic since the outlier itself affects the mean):

myFunction4 = function(x, na.rm = TRUE, ...) {

    rawMean <- mean(x, na.rm = TRUE)
    sds <- sd(x, na.rm = TRUE)

    lower_cutoff = rawMean - (2*sds)    
    upper_cutoff = rawMean + (2*sds)

    q <- x
    q[x < lower_cutoff ] <- NA
    q[x > upper_cutoff] <- NA

    #return a mean for all values that are not outliers
    mean(q, na.rm = TRUE)   

}

The MAD method (Median Absolute Deviation):

myFunction5 = function(x, na.rm = TRUE, ...) {
    medianX <- median(x, na.rm = FALSE)
    #print (medianX)


    MAD <- median(abs(x-medianX))
    #print(MAD)

    Mi <- (.6745 *(x-medianX))/MAD
    #print(Mi)

    q <- x
    q[abs(Mi) > 3.5] <- NA

    #return a mean for all values that are not outliers
    mean(q, na.rm = TRUE)   

}

Just taking the median of the values:

myFunction6 = function(x, na.rm = TRUE, ...) {
    median(x, na.rm = FALSE)

}

All of course implemented with the same block of code except for the changing of the function called by FUN=

newtable2 <- aggregate(toy[, -c(1,2)],
                      by = list(A = toy$A),          #Here A could/would be the column header
                      FUN = myFunction6,
                      na.rm = TRUE)

newtable2
ADD REPLY
1
Entering edit mode
4.2 years ago

Hello all, I am looking for an excel wizard to help me with a messy task:

You know what Excel does to gene names, right? Give it a try with Mar3 and Sep6.

The response I'd like to give is not quite safe for work, so I'll just advise you to google "bioinformatics clippy excel", and take note of the first two entries under Images.

ADD COMMENT

Login before adding your answer.

Traffic: 1457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6