Question: problem for Collapsing Probes For Same Gene in microarray gene expression data
0
gravatar for modarzi
28 days ago by
modarzi90
modarzi90 wrote:

Hi,It may my question were asked by another person but I have below Microarray gene expression as 'myExprdat' dataframe:

    ID           Gene symbol     Sample1     Sample2     Sample3     Sample4    
1   1007_s_at   MIR4640///DDR1   108.38        321.8       66.72      19.43 
2   1053_at         RFC2         121.1         148.06      306.55     242.19    
3   117_at          HSPA6        107.63        59.71       163.14     24.42 
4   121_at         CYP2E1        8.51           4.72       4.79       10.78 
5   1255_g_at      GUCA1A       4.23            4.26       4.26        4.26 
6   1294_at    MIR5193///UBA7   131.6          82.71      191.34      70.52 
7   1316_at         THRA         9             8.17       8.06         7.94 
8   1320_at        PTPN21       6.45           6.63       6.77         6.87 
9   1405_i_at      HSPA6       1379.57         215.27     191.34      108.38    
10  1431_at        CYP2E1       5.94            6.11       6.11       6.06

So, for combining some rows based on 'Gene symbol' and theire mean value, I used below code:

myExprdat_aggregate <- aggregate(myExprdat[, -c(1,2)],
          by = list(Gene = myExprdat$`Gene symbol`),
          FUN = mean,
          na.rm = TRUE)

and I get a dataframe that has 8 rows(Gene symbol) and 5 Columns(Gene and Sample 1 to sample4) but I don't know why all cells of 'myExprdat_aggregate' are NA?

I appreciate if anybody shares his/her comments with me.

ADD COMMENTlink modified 28 days ago • written 28 days ago by modarzi90
2
gravatar for Kevin Blighe
28 days ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

Hey, what is the output of str(myExprdat)? My idea is that some of your 'numerical' columns are not encoded numerically. Note, that, for by, you could just use myExprdat[2]

ADD COMMENTlink modified 27 days ago • written 28 days ago by Kevin Blighe52k

I run str(myExprdat) and got below results:

'data.frame':   10 obs. of  10 variables:
 $ ID         : Factor w/ 54613 levels "1007_s_at","1053_at",..: 1 2 3 4 5 6 7 8 9 10
  ..- attr(*, "names")= chr  "1007_s_at" "1053_at" "117_at" "121_at" ...
 $ Gene symbol: chr  "MIR4640///DDR1" "RFC2" "HSPA6" "CYP2E1" ...
 $ Sample1  : Factor w/ 1348 levels "1.62","1.82",..: 46 117 43 1234 823 158 1289 1072 180 978
  ..- attr(*, "names")= chr  "1007_s_at" "1053_at" "117_at" "121_at" ...
 $ Sample2  : Factor w/ 1347 levels "1.64","1.83",..: 710 218 1054 840 825 1251 1226 1074 450 1062
  ..- attr(*, "names")= chr  "1007_s_at" "1053_at" "117_at" "121_at" ...
 $ Sample3  : Factor w/ 1349 levels "1.61","1.8","10.06",..: 1126 680 278 845 828 367 1228 1079 367 1064
  ..- attr(*, "names")= chr  "1007_s_at" "1053_at" "117_at" "121_at" ...
 $ Sample4  : Factor w/ 1356 levels "1.62","1.83",..: 206 526 518 13 830 1180 1178 1089 49 1071

I cant understand your mean about your last comment:

"Note, that, for by, you could just use myExprdat[2]"

ADD REPLYlink modified 28 days ago • written 28 days ago by modarzi90
1

Hey, well, there is your problem: your numerical values are encoded as factors. You will have to go back a few steps to find out which step is resulting in these numerical values being regarded as factors.

For the second comment, I mean that you just have to do:

myExprdat_aggregate <- aggregate(
  myExprdat[, -c(1,2)],
  by = myExprdat[2],
  FUN = mean,
  na.rm = TRUE)
ADD REPLYlink written 28 days ago by Kevin Blighe52k

Dear Dr. Blighe, I went back and found the reason of be factor. Now I run str(myExprdat) and I got below result:

'data.frame':   10 obs. of  10 variables:
 $ ID         : chr  "1007_s_at" "1053_at" "117_at" "121_at" ...
 $ Gene symbol: chr  "MIR4640///DDR1" "RFC2" "HSPA6" "PAX8" ...
 $ Sample1  : chr  "108.38" "121.1" "107.63" "8.51" ...
 $ Sample2  : chr  "321.8" "148.06" "59.71" "4.72" ...
 $ Sample3 : chr  "66.72" "306.55" "163.14" "4.79" ...
 $ Sample4 : chr  "142.02" "242.19" "24.42" "10.78" ...

and run aggregate() but again I got a datafram by full of NA.

really, I don't know why I got that result?

ADD REPLYlink modified 28 days ago • written 28 days ago by modarzi90
1

Similar issue here... now, however, your numerical values are encoded as characters. You will have to encode them as numeric values. Here, I will reproduce your problem and then solve it:

h
  col1 col2    a      b
1    a    j 87.8     56
2    b    k 5.55    453
3    c    f  7.6 545.45

str(h)
'data.frame':   3 obs. of  4 variables:
 $ col1: Factor w/ 3 levels "a","b","c": 1 2 3
 $ col2: Factor w/ 3 levels "f","j","k": 2 3 1
 $ a   : chr  "87.8" "5.55" "7.6"
 $ b   : chr  "56" "453" "545.45"

Now, convert the relevant columns to numerical values:

h <- data.frame(h[,1:2], apply(h[,3:ncol(h)], 2, as.numeric))

h
  col1 col2     a      b
1    a    j 87.80  56.00
2    b    k  5.55 453.00
3    c    f  7.60 545.45

str(h)
'data.frame':   3 obs. of  4 variables:
 $ col1: Factor w/ 3 levels "a","b","c": 1 2 3
 $ col2: Factor w/ 3 levels "f","j","k": 2 3 1
 $ a   : num  87.8 5.55 7.6
 $ b   : num  56 453 545
ADD REPLYlink written 28 days ago by Kevin Blighe52k

Thanks for your comment. I have run that code and now I have a matrix which aggregates probs with similar gene symbols. but the in "Gene" column of the result' matrix for the first row It doesn't have Gene name. In other words,based on below data frame in the first row for each sample it has values but the Gene name is not clear:

    Gene       Sample1       Sample2      Sample3     Sample4    
1                108.38        321.8       66.72      19.43 
2   A1BG         121.1         148.06      306.55     242.19    
3   A1BG-AS1     107.63        59.71       163.14     24.42 
4   A1CF         8.51          4.72       4.79       10.78

I appreciate if you share your comment with me.

ADD REPLYlink modified 24 days ago • written 24 days ago by modarzi90

I am not to know the source of that particular problem. Please review all steps of your code, reviewing both input and output, in order to understand why there may be no gene name there.

ADD REPLYlink written 24 days ago by Kevin Blighe52k

Thanks. Dear Dr. Blighe. I have one more question:

As you see in the 'Gene symbol' of my microarray dataset some genes have 2 names.e.g, 'MIR4640///DDR1' or 'MIR5193///UBA7'.

what should I do by these gene symbols? can I remove the second part of the name(name after '///')? I mean can I have 'MIR4640' or 'MIR5193'?

I appreciate if you share your comment with me.

ADD REPLYlink modified 24 days ago • written 24 days ago by modarzi90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1010 users visited in the last hour