Merging two datasets related to protiens and calculating the molec weight
0
0
Entering edit mode
6.9 years ago

Hello!

I have been trying to solve this problem for the better part of three days now. Here is a way to create sample data and then I will explain the problem:

set.seed(1)

id <- c(231466,231466,231466,158635979,158635979,158635979)
type <- c('Phosphoserine', 'Phosphoserine', 'Glycosylation', 'Disulfide bond
','Nitrated tyrosine', 'Nitrated tyrosine')
start <- floor(runif(6, min=1, max=300))
end <- start

PTMs <- data.frame(id,type,start,end)
PTMs$end[4] <- 300


id2 <- c(231466,231466,231466,158635979,158635979,158635979, 114326480, 114326480, 114326480)
start2 <- floor(runif(9, min=1, max=100))
end2 <- floor(runif(9, min=100, max=300))
molec <- floor(runif(9, min=1000, max=10000))
proteins <- data.frame(id2,start2,end2,molec)

proteins$start2[1] <- 1
proteins$start2[4] <- 1
proteins$start2[7] <-1
proteins$end2[1] <- 300
proteins$end2[7] <- 270


PTM <- unique(type)
MOLECCHANGE <- floor(runif(4, min=2, max=100))
ptmchange <- data.frame(PTM, MOLECCHANGE)

So basically what I need to do is merge PTMs with proteins. protiens is a dataset of proteins, subregions, geneid, sequence, molecular weight, and start/end locations. I left out sequences because it is irrelevant for this merge. PTMs is geneid, type of ptm, start/end. ptmchange is a table with how the ptms will effect the molecular weight. This data doesn't represent reality, it is just a sample.

So in the merge of proteins to PTMs I have to check if the gene ID matches, and if the ptm within the range of this location. If both of those are true id like to make another columns called predicted_molecweight and update it from the ptmchange table for that respective PTM.

I have been trying everything but haven't been able to get it. protiens is usually much larger than PTMs. Also sometimes the same PTM comes multiple times in the same protein. I wish I could just use the ID but the location must be known to have an accurate molecular weight for the subregions. Also all PTMs occur at only one location accept Disulfide bonds. This has been killing for days! Any help will be appreciated. Thanks for taking the time to read this.

This is my current attempt but it has many flaws. I was going to add columns for all possible PTMs (about 15) in PTMs, then add a 0 or 1 if it exists in that section of the protein. The problem of course is if two of the same occur it would not be counted so I would somehow have to get a count. It just doesn't work.

proteins <- cbind(proteins, setNames( lapply(PTMs$type, function(x) x=NA), PTMs$type) )

for(i in 1:length(proteins$id)){
  if(proteins$id[i]  == PTMs$id){
    r <- proteins$start2[i]:proteins$end2[i]
    r2<- PTMs$start[i]
    if(r2 %in% r){
      for(i in 7:length(PTM$type)){
        if(PTM$type[i] == colnames(proteins[i])){
          proteins[i] <- 1
        }else{
          proteins[i] <- 0
        }
      }

    }

  }
}
R data protein • 1.8k views
ADD COMMENT
1
Entering edit mode

You're doing an awful lot of different thing at the same time. To be effective in problem solving we ahve to break it down to smaller tasks. Find where the problem really lies. Is this a coding/variables problem, or an algorithmic problem? R can be funny with variable scoping, so don't try to do too much in one variable. use more separate lists and be sure you can follow what it's doing.

ADD REPLY
0
Entering edit mode

Thanks for the response! I know I know. I have been trying to break it down but the problem is I need to know if the geneID is matching AND if it is in the location at the same time. That is why I feel like I need to do it all at once.

Originally like I have above, I was going to code it but that doesn't work because I would need the count. Since the same PTM can occur multiple times in the same region.

ADD REPLY

Login before adding your answer.

Traffic: 2507 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6