Question

Merging two datasets related to protiens and calculating the molec weight

0

Entering edit mode

8.2 years ago

hakimelakhrass ▴ 80

Hello!

I have been trying to solve this problem for the better part of three days now. Here is a way to create sample data and then I will explain the problem:

set.seed(1)

id <- c(231466,231466,231466,158635979,158635979,158635979)
type <- c('Phosphoserine', 'Phosphoserine', 'Glycosylation', 'Disulfide bond
','Nitrated tyrosine', 'Nitrated tyrosine')
start <- floor(runif(6, min=1, max=300))
end <- start

PTMs <- data.frame(id,type,start,end)
PTMs$end[4] <- 300


id2 <- c(231466,231466,231466,158635979,158635979,158635979, 114326480, 114326480, 114326480)
start2 <- floor(runif(9, min=1, max=100))
end2 <- floor(runif(9, min=100, max=300))
molec <- floor(runif(9, min=1000, max=10000))
proteins <- data.frame(id2,start2,end2,molec)

proteins$start2[1] <- 1
proteins$start2[4] <- 1
proteins$start2[7] <-1
proteins$end2[1] <- 300
proteins$end2[7] <- 270


PTM <- unique(type)
MOLECCHANGE <- floor(runif(4, min=2, max=100))
ptmchange <- data.frame(PTM, MOLECCHANGE)

So basically what I need to do is merge PTMs with proteins. protiens is a dataset of proteins, subregions, geneid, sequence, molecular weight, and start/end locations. I left out sequences because it is irrelevant for this merge. PTMs is geneid, type of ptm, start/end. ptmchange is a table with how the ptms will effect the molecular weight. This data doesn't represent reality, it is just a sample.

So in the merge of proteins to PTMs I have to check if the gene ID matches, and if the ptm within the range of this location. If both of those are true id like to make another columns called predicted_molecweight and update it from the ptmchange table for that respective PTM.

I have been trying everything but haven't been able to get it. protiens is usually much larger than PTMs. Also sometimes the same PTM comes multiple times in the same protein. I wish I could just use the ID but the location must be known to have an accurate molecular weight for the subregions. Also all PTMs occur at only one location accept Disulfide bonds. This has been killing for days! Any help will be appreciated. Thanks for taking the time to read this.

This is my current attempt but it has many flaws. I was going to add columns for all possible PTMs (about 15) in PTMs, then add a 0 or 1 if it exists in that section of the protein. The problem of course is if two of the same occur it would not be counted so I would somehow have to get a count. It just doesn't work.

proteins <- cbind(proteins, setNames( lapply(PTMs$type, function(x) x=NA), PTMs$type) )

for(i in 1:length(proteins$id)){
  if(proteins$id[i]  == PTMs$id){
    r <- proteins$start2[i]:proteins$end2[i]
    r2<- PTMs$start[i]
    if(r2 %in% r){
      for(i in 7:length(PTM$type)){
        if(PTM$type[i] == colnames(proteins[i])){
          proteins[i] <- 1
        }else{
          proteins[i] <- 0
        }
      }

    }

  }
}

R data protein • 2.1k views

ADD COMMENT • link 8.2 years ago by hakimelakhrass ▴ 80

1

Entering edit mode

You're doing an awful lot of different thing at the same time. To be effective in problem solving we ahve to break it down to smaller tasks. Find where the problem really lies. Is this a coding/variables problem, or an algorithmic problem? R can be funny with variable scoping, so don't try to do too much in one variable. use more separate lists and be sure you can follow what it's doing.

ADD REPLY • link 8.2 years ago by karl.stamm 4.1k

0

Entering edit mode

Thanks for the response! I know I know. I have been trying to break it down but the problem is I need to know if the geneID is matching AND if it is in the location at the same time. That is why I feel like I need to do it all at once.

Originally like I have above, I was going to code it but that doesn't work because I would need the count. Since the same PTM can occur multiple times in the same region.

ADD REPLY • link 8.2 years ago by hakimelakhrass ▴ 80