Question

problem for Collapsing Probes For Same Gene in microarray gene expression data

0

Entering edit mode

5.7 years ago

modarzi ▴ 170

Hi,It may my question were asked by another person but I have below Microarray gene expression as 'myExprdat' dataframe:

    ID           Gene symbol     Sample1     Sample2     Sample3     Sample4    
1   1007_s_at   MIR4640///DDR1   108.38        321.8       66.72      19.43 
2   1053_at         RFC2         121.1         148.06      306.55     242.19    
3   117_at          HSPA6        107.63        59.71       163.14     24.42 
4   121_at         CYP2E1        8.51           4.72       4.79       10.78 
5   1255_g_at      GUCA1A       4.23            4.26       4.26        4.26 
6   1294_at    MIR5193///UBA7   131.6          82.71      191.34      70.52 
7   1316_at         THRA         9             8.17       8.06         7.94 
8   1320_at        PTPN21       6.45           6.63       6.77         6.87 
9   1405_i_at      HSPA6       1379.57         215.27     191.34      108.38    
10  1431_at        CYP2E1       5.94            6.11       6.11       6.06

So, for combining some rows based on 'Gene symbol' and theire mean value, I used below code:

myExprdat_aggregate <- aggregate(myExprdat[, -c(1,2)],
          by = list(Gene = myExprdat$`Gene symbol`),
          FUN = mean,
          na.rm = TRUE)

and I get a dataframe that has 8 rows(Gene symbol) and 5 Columns(Gene and Sample 1 to sample4) but I don't know why all cells of 'myExprdat_aggregate' are NA?

I appreciate if anybody shares his/her comments with me.

aggregate Microarray Collapsing Probes • 1.6k views

ADD COMMENT • link 5.7 years ago by modarzi ▴ 170

score 2 · Accepted Answer · 2019-11-16

2

Entering edit mode

5.7 years ago

Kevin Blighe 89k

Hey, what is the output of str(myExprdat)? My idea is that some of your 'numerical' columns are not encoded numerically. Note, that, for by, you could just use myExprdat[2]

ADD COMMENT • link 5.7 years ago by Kevin Blighe 89k

0

Entering edit mode

I run str(myExprdat) and got below results:

'data.frame':   10 obs. of  10 variables:
 $ ID         : Factor w/ 54613 levels "1007_s_at","1053_at",..: 1 2 3 4 5 6 7 8 9 10
  ..- attr(*, "names")= chr  "1007_s_at" "1053_at" "117_at" "121_at" ...
 $ Gene symbol: chr  "MIR4640///DDR1" "RFC2" "HSPA6" "CYP2E1" ...
 $ Sample1  : Factor w/ 1348 levels "1.62","1.82",..: 46 117 43 1234 823 158 1289 1072 180 978
  ..- attr(*, "names")= chr  "1007_s_at" "1053_at" "117_at" "121_at" ...
 $ Sample2  : Factor w/ 1347 levels "1.64","1.83",..: 710 218 1054 840 825 1251 1226 1074 450 1062
  ..- attr(*, "names")= chr  "1007_s_at" "1053_at" "117_at" "121_at" ...
 $ Sample3  : Factor w/ 1349 levels "1.61","1.8","10.06",..: 1126 680 278 845 828 367 1228 1079 367 1064
  ..- attr(*, "names")= chr  "1007_s_at" "1053_at" "117_at" "121_at" ...
 $ Sample4  : Factor w/ 1356 levels "1.62","1.83",..: 206 526 518 13 830 1180 1178 1089 49 1071

I cant understand your mean about your last comment:

"Note, that, for by, you could just use myExprdat[2]"

ADD REPLY • link 5.7 years ago by modarzi ▴ 170

1

Entering edit mode

Hey, well, there is your problem: your numerical values are encoded as factors. You will have to go back a few steps to find out which step is resulting in these numerical values being regarded as factors.

For the second comment, I mean that you just have to do:

myExprdat_aggregate <- aggregate(
  myExprdat[, -c(1,2)],
  by = myExprdat[2],
  FUN = mean,
  na.rm = TRUE)

ADD REPLY • link 5.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Dear Dr. Blighe, I went back and found the reason of be factor. Now I run str(myExprdat) and I got below result:

'data.frame':   10 obs. of  10 variables:
 $ ID         : chr  "1007_s_at" "1053_at" "117_at" "121_at" ...
 $ Gene symbol: chr  "MIR4640///DDR1" "RFC2" "HSPA6" "PAX8" ...
 $ Sample1  : chr  "108.38" "121.1" "107.63" "8.51" ...
 $ Sample2  : chr  "321.8" "148.06" "59.71" "4.72" ...
 $ Sample3 : chr  "66.72" "306.55" "163.14" "4.79" ...
 $ Sample4 : chr  "142.02" "242.19" "24.42" "10.78" ...

and run aggregate() but again I got a datafram by full of NA.

really, I don't know why I got that result?

ADD REPLY • link 5.7 years ago by modarzi ▴ 170

1

Entering edit mode

Similar issue here... now, however, your numerical values are encoded as characters. You will have to encode them as numeric values. Here, I will reproduce your problem and then solve it:

h
  col1 col2    a      b
1    a    j 87.8     56
2    b    k 5.55    453
3    c    f  7.6 545.45

str(h)
'data.frame':   3 obs. of  4 variables:
 $ col1: Factor w/ 3 levels "a","b","c": 1 2 3
 $ col2: Factor w/ 3 levels "f","j","k": 2 3 1
 $ a   : chr  "87.8" "5.55" "7.6"
 $ b   : chr  "56" "453" "545.45"

Now, convert the relevant columns to numerical values:

h <- data.frame(h[,1:2], apply(h[,3:ncol(h)], 2, as.numeric))

h
  col1 col2     a      b
1    a    j 87.80  56.00
2    b    k  5.55 453.00
3    c    f  7.60 545.45

str(h)
'data.frame':   3 obs. of  4 variables:
 $ col1: Factor w/ 3 levels "a","b","c": 1 2 3
 $ col2: Factor w/ 3 levels "f","j","k": 2 3 1
 $ a   : num  87.8 5.55 7.6
 $ b   : num  56 453 545

ADD REPLY • link 5.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks for your comment. I have run that code and now I have a matrix which aggregates probs with similar gene symbols. but the in "Gene" column of the result' matrix for the first row It doesn't have Gene name. In other words,based on below data frame in the first row for each sample it has values but the Gene name is not clear:

    Gene       Sample1       Sample2      Sample3     Sample4    
1                108.38        321.8       66.72      19.43 
2   A1BG         121.1         148.06      306.55     242.19    
3   A1BG-AS1     107.63        59.71       163.14     24.42 
4   A1CF         8.51          4.72       4.79       10.78

I appreciate if you share your comment with me.

ADD REPLY • link 5.6 years ago by modarzi ▴ 170

0

Entering edit mode

I am not to know the source of that particular problem. Please review all steps of your code, reviewing both input and output, in order to understand why there may be no gene name there.

ADD REPLY • link 5.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks. Dear Dr. Blighe. I have one more question:

As you see in the 'Gene symbol' of my microarray dataset some genes have 2 names.e.g, 'MIR4640///DDR1' or 'MIR5193///UBA7'.

what should I do by these gene symbols? can I remove the second part of the name(name after '///')? I mean can I have 'MIR4640' or 'MIR5193'?

I appreciate if you share your comment with me.

ADD REPLY • link 5.6 years ago by modarzi ▴ 170