Question: For each row retain the cell with maximum value in R
0
gravatar for haneenih7
5 weeks ago by
haneenih770
KAUST
haneenih770 wrote:

Hello,

I am trying to write a code in R to get the GO label has the highest confident score that comes after " | " symbol

For each gene ID (each row), there are many Go labels (columns), it can go up to 400 labels. And the Go-term with highest confident score can be in any column.

see example:

GeneID         GO_01          GO_02           GO_03          GO_04
exi2A01G0001540.1      GO:0005575|0.853        GO:0005622|0.705        GO:0005623|0.846        GO:0005634|0.531
exi2A01G0001560.1      GO:0005575|0.324        GO:0044699|0.319        GO:0044464|0.324        GO:0005623|0.524
exi9A01G0045270.1      GO:0003674|0.356        GO:0005575|0.679        GO:0005622|0.539

I think it's possible to retain the GO-labels that has the highest confident score.

So for example results would be like this:

GeneID      GO-term
exi2A01G0001540.1   GO:0005575|0.853
exi2A01G0001560.1   GO:0005623|0.524
exi9A01G0045270.1   GO:0005575|0.679

I srarted R code:

GO_1 <- read.table("proteinGO-term_0.3.txt", header=T, sep="\t", fill=T)
#have gene ID as a row name:
GO_2 <- GO_1[,-1]
rownames(GO_2) <- GO_1[,1]
#
#I tried this, but it doesn't do what I want:
test <- apply(GO_2,1,function(x) which(x==max(x)))

Thanks !!!

R • 135 views
ADD COMMENTlink modified 5 weeks ago by ATpoint36k • written 5 weeks ago by haneenih770
1
gravatar for cpad0112
5 weeks ago by
cpad011213k
India
cpad011213k wrote:
> cbind(test[,1],(t(apply(test[,-1],1, function(x) x[order(as.numeric(sub("\\w*:[0-9]*\\|","",x)),decreasing = T)]))))[,1:2]
     [,1]                [,2]              
[1,] "exi2A01G0001540.1" "GO:0005575|0.853"
[2,] "exi2A01G0001560.1" "GO:0005623|0.524"
[3,] "exi9A01G0045270.1" "GO:0005575|0.679"
ADD COMMENTlink written 5 weeks ago by cpad011213k

This code doesn't output the gene id It only retains the GO terms.

The output looks like:

"6937"  "GO:0005575|0.868" 
"6938"  "GO:0005575|0.876"
"6939"  "GO:0005575|0.399"
"6941"  "GO:0005575|0.345"
ADD REPLYlink written 5 weeks ago by haneenih770

Please post the example data. Code works on the data furnished in OP. If you are concerned about column names, you can do this:

$ cbind("GeneID"=test[,1],"GO-term"=apply(test[,-1],1, function(x) x[order(as.numeric(sub("\\D+\\d+\\D","",x)),decreasing = T)][1]))

     GeneID              GO-term           
[1,] "exi2A01G0001540.1" "GO:0005575|0.853"
[2,] "exi2A01G0001560.1" "GO:0005623|0.524"
[3,] "exi9A01G0045270.1" "GO:0005575|0.679"
ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by cpad011213k

I make the first column as row names. It's working now. Thanks a lot!!!!!!

ADD REPLYlink written 5 weeks ago by haneenih770

Glad that it worked. But it is not supposed to work that way given the data you posted in OP.

ADD REPLYlink written 5 weeks ago by cpad011213k

Yeah you are right head(test[,1]) Showed me the gene ID, but I have no idea, running the whole line, doesn't show the gene ID but anyway the good thing now it worked when making the first column as row names

However, there is another problem. When a gene has only one GO-term, it leaves it empty. But it should retain that only Go-label for the corresponding gene. right?

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by haneenih770
1

sure. It should do so. It also depends on you are reading input (txt) file in to R. I have created example file, where the GO term is present in only column, but absent in all other columns, for few GO columns. Follow the code below:

$ cat GO_test.txt 
GeneID  GO_01   GO_02   GO_03   GO_04
gene1   GO:0005575|0.853    GO:0005622|0.705    GO:0005623|0.846    GO:0005634|0.531
gene2   GO:0005575|0.324    GO:0044699|0.319    GO:0044464|0.324    GO:0005623|0.524
gene3   GO:0003674|0.356    GO:0005575|0.679    GO:0005622|0.539
gene4           GO:0005622|0.539
gene5       GO:0005575|0.679
gene6   GO:0005575|0.679

Gene 4 has entry in third column and all other columns are empty. Gene 5 has entry in 2nd column and all other columns are empty. Gene 6 has entry in 1st column and all other columns are empty. Here is the R code:

> test=read.csv("GO_test.txt", header = T, sep = "\t", strip.white = T, na.strings = "")
> test
  GeneID            GO_01            GO_02            GO_03            GO_04
1  gene1 GO:0005575|0.853 GO:0005622|0.705 GO:0005623|0.846 GO:0005634|0.531
2  gene2 GO:0005575|0.324 GO:0044699|0.319 GO:0044464|0.324 GO:0005623|0.524
3  gene3 GO:0003674|0.356 GO:0005575|0.679 GO:0005622|0.539             <NA>
4  gene4             <NA>             <NA> GO:0005622|0.539             <NA>
5  gene5             <NA> GO:0005575|0.679             <NA>             <NA>
6  gene6 GO:0005575|0.679             <NA>             <NA>             <NA>
> cbind("GeneID"=test[,1],"GO-term"=apply(test[,-1],1, function(x) x[order(as.numeric(sub("\\D+\\d+\\D","",x)),decreasing = T)][1]))
     GeneID  GO-term           
[1,] "gene1" "GO:0005575|0.853"
[2,] "gene2" "GO:0005623|0.524"
[3,] "gene3" "GO:0005575|0.679"
[4,] "gene4" "GO:0005622|0.539"
[5,] "gene5" "GO:0005575|0.679"
[6,] "gene6" "GO:0005575|0.679"
ADD REPLYlink written 5 weeks ago by cpad011213k

The code is perfectly working. It was something to do with the input file.

Thank you very much! I really appreciate your help :)

ADD REPLYlink written 5 weeks ago by haneenih770
1

I moved @cpad112's comment to an answer. Please accept (green checkmark) to provide closure to this thread.

ADD REPLYlink written 5 weeks ago by genomax85k

No problem. Keep visiting and contributing to Biostars.

ADD REPLYlink written 5 weeks ago by cpad011213k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1606 users visited in the last hour