Question

Remove duplicate genes with lower significance in microarray data analysis

0

Entering edit mode

4.7 years ago

Gene_MMP8 ▴ 240

I have performed microarray data analysis using limma in r and I have a list of DEGs. Now I have some repetitions in gene symbols and I want to keep the genes with the highest significance(adjusted p value). How can I do that in R?

I want to keep only the unique genes with highest significance in case of duplicates. Below is the code that I am trying. tT is the DEG table above. This is only half of the code. I am trying to loop through all the names and if I find a repetition then compare that with the other duplicates.

for(i in tT$SYMBOL){ if(length(which(tT$SYMBOL==i))>1){
index=tT[which(tT$SYMBOL[-c(i),7]==i),]  }

Really need some help. Thanks

R limma microarray RNA-Seq • 3.3k views

ADD COMMENT • link updated 4.7 years ago by Chirag Parsania ★ 2.0k • written 4.7 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

You're working on:

limma
differential expression
microarray

Yet the only tag used is R. Why is that?

ADD REPLY • link 4.7 years ago by Ram 43k

0

Entering edit mode

In other words we want "group by min", see related StackOverflow post:

Extract row corresponding to minimum value of a variable by group

ADD REPLY • link 4.7 years ago by zx8754 11k

0

Entering edit mode

Hi banerjeeshayantan,

I have same problem. Did you fix it and how? Thanks

ADD REPLY • link 3.4 years ago by cagdas ▴ 10

score 2 · Answer 1 · 2019-08-02

2

Entering edit mode

4.7 years ago

AB ▴ 360

Order your dataframe by geneid and pvalue and remove duplicated values

tT =  tT[order(tT$SYMBOL,tT$p.val),]
new_tT = tT[ !duplicated(tT$SYMBOL), ]

ADD COMMENT • link 4.7 years ago by AB ▴ 360

score 0 · Answer 2 · 2019-08-04

0

Entering edit mode

4.7 years ago

Chirag Parsania ★ 2.0k

See the toy example below.

library(tidyverse)

## cartoon expression data which has duplicated values in column 1 
set.seed(32323)
expr_data <- tibble(gene_id = sample(LETTERS[1:5] , 10 , replace = T) , expr =  rnorm(10 ,mean = 10) ) %>% arrange(gene_id)

expr_data
#> # A tibble: 10 x 2
#>    gene_id  expr
#>    <chr>   <dbl>
#>  1 A        9.39
#>  2 B        9.43
#>  3 C        9.52
#>  4 C        9.80
#>  5 C       11.8 
#>  6 D        9.08
#>  7 D        8.76
#>  8 D        9.59
#>  9 E       11.4 
#> 10 E        9.40

## C, D and E are duplicated in column 1. 

## if duplicate in column 1 get the observation which has highest in column 2 

expr_data %>% 
        group_by(gene_id) %>%  ## group by id column 
        dplyr::arrange(desc(expr)) %>% ## arrange each group high to low
        slice(1) ## get first row from each group
#> # A tibble: 5 x 2
#> # Groups:   gene_id [5]
#>   gene_id  expr
#>   <chr>   <dbl>
#> 1 A        9.39
#> 2 B        9.43
#> 3 C       11.8 
#> 4 D        9.59
#> 5 E       11.4

^{Created on 2019-08-04 by the reprex package (v0.3.0)}

ADD COMMENT • link 4.7 years ago by Chirag Parsania ★ 2.0k

0

Entering edit mode

I think a better logic would be group by followed by max instead of sort + slice(1). What do you think?

ADD REPLY • link 4.7 years ago by Ram 43k

1

Entering edit mode

Yes, true. More readable and less code. Thanks :)

======= Edit

However, with max, if there is tie all matching rows will be returned ... see the example

iris  %>% as_tibble() %>% group_by(Species) %>% filter(Petal.Width == max(Petal.Width))
# A tibble: 5 x 5
# Groups:   Species [3]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
1          5           3.5          1.6         0.6 setosa    
2          5.9         3.2          4.8         1.8 versicolor
3          6.3         3.3          6           2.5 virginica 
4          7.2         3.6          6.1         2.5 virginica 
5          6.7         3.3          5.7         2.5 virginica

ADD REPLY • link 4.7 years ago by Chirag Parsania ★ 2.0k

0

Entering edit mode

Fair enough. Thanks for alerting to the use case for slice(1) :-)

ADD REPLY • link 4.7 years ago by Ram 43k