Question: Match column row wise and do calculations
0
gravatar for archana.bioinfo87
26 days ago by
archana.bioinfo87160 wrote:

Hi, I have a file like this. I have to put some criteria like if chromosome matched, then check start position matched; if matched then check for end position matched. If it ends position not matched then put value 1.1 else put value 1

Chromosome Start End   
    1   990595  990685 
    1   990595  990882
    1   1459777 1460585
    1   1563779 1563827
    2   1564102 1564215
    2   1564108 1564219

I am looking for output like this

 Chromosome Start End  
 1  990595  990685  1.1
1   990595  990882  1.2
1   1459777 1460585 2
1   1563779 1563827 3
2   1564102 1564215 1
2   1564108 1564219 2
R • 193 views
ADD COMMENTlink modified 26 days ago by Sam2.4k • written 26 days ago by archana.bioinfo87160
1

You probably need:

GenomicRanges::reduce()
GenomicRanges::findOverlaps()

Or from data.table package:

data.table::foverlaps()
ADD REPLYlink written 25 days ago by zx87547.8k

What have you tried? Please add some context (on the biological problem you're trying to solve) so this pure R question can qualify as a bioinformatics question.

ADD REPLYlink written 26 days ago by RamRS22k

Ok I will modify the question for better understanding.

ADD REPLYlink written 26 days ago by archana.bioinfo87160

This looks like an XY problem. What are you doing this exercise for?

ADD REPLYlink written 26 days ago by RamRS22k

I have to find the overlap region if it has been found then add new column like serial number 1.1, 1.2 etc

ADD REPLYlink written 26 days ago by archana.bioinfo87160
1

That is a verbatim description of the problem, not the reason why this problem needs solving (See what an XY problem means). There is also another factor you have not mentioned: the numbering resets with the start of a new chromosome. This is quite the convoluted logic and I don't know why you're doing it, so I want to make sure this problem needs solving before starting to solve it.

ADD REPLYlink modified 26 days ago • written 26 days ago by RamRS22k
0
gravatar for Sam
26 days ago by
Sam2.4k
New York
Sam2.4k wrote:

In R, assume your data is stored in a data.frame call dat

dat$Group <- 1
counter<-0
prev.chr <- NA
prev.loc <- 0
sub.count <- 0.1
for(i in 1:nrow(dat)){
    if ( is.na(prev.chr) | prev.chr != dat$Chromosome[i]){
        counter <- 0
        prev.chr <- dat$Chromosome[i]
        prev.loc <- dat$Start[i]
        sub.count <- 0.1
    }
    else if(prev.loc == dat$Start[i]){
        dat$Group[i] <- dat$Group[i]+counter+sub.count
        sub.count<- sub.count+0.1
    }else{
        prev.loc <- dat$Start[i]
        counter<- counter+1
        sub.count <- 0.1
        dat$Group[i] <- dat$Group[i] + counter
    }
}

(There are definitely better ways to go about it, the above codes are just as a reference)

ADD COMMENTlink modified 25 days ago • written 26 days ago by Sam2.4k
1

Your code is incomplete. Also, your first comparison is NA vs a value. See what happens when you do that:

> if(NA != 1) print('Hello')
Error in if (NA != 1) print("Hello") :
  missing value where TRUE/FALSE needed

On a side note, this sort of programming is like using python or java naively, where language specific features are discarded to "make something work". I'm sure there are better ways in R to partition a dataset and order within partitions in a custom manner, using a loop seems to be really inefficient in a language that vectorizes operations and uses the apply family to great effect.

On second thought, I guess OP's problem cannot really be solved without forcing R to process a data.frame row-by-row, which is why I suspect this is an XY problem.


EDIT: Here's a possible sequence of operations to get to the result:


Step-1: group by chromosome and start, and number groups

INPUT:

Chromosome Start    End   
1          990595   990685 
1          990595   990882
1          1459777  1460585
1          1563779  1563827
2          1564102  1564215
2          1564108  1564219

OUTPUT:

Chromosome Start    End       gp.1
1          990595   990685    1
1          990595   990882    1
1          1459777  1460585   2
1          1563779  1563827   3
2          1564102  1564215   1
2          1564108  1564219   2

Step-2: group by chromosome gp.1; and number groups, while counting group membership

INPUT:

Chromosome Start    End       gp.1
1          990595   990685    1
1          990595   990882    1
1          1459777  1460585   2
1          1563779  1563827   3
2          1564102  1564215   1
2          1564108  1564219   2

OUTPUT:

Chromosome Start    End       gp.1    gp.2    gp.count
1          990595   990685    1       1       2
1          990595   990882    1       2       2
1          1459777  1460585   2       1       1
1          1563779  1563827   3       1       1
2          1564102  1564215   1       1       1
2          1564108  1564219   2       1       1

Step-3: Determine final group

If count is 1 for a group, gp is gp.1. If count > 1, gp = paste(gp.1,gp.2,sep=".")

INPUT:

Chromosome Start    End       gp.1    gp.2    gp.count
1          990595   990685    1       1       2
1          990595   990882    1       2       2
1          1459777  1460585   2       1       1
1          1563779  1563827   3       1       1
2          1564102  1564215   1       1       1
2          1564108  1564219   2       1       1

OUTPUT:

Chromosome Start    End       gp.1    gp.2    gp.count    final.gp
1          990595   990685    1       1       2           1.1
1          990595   990882    1       2       2           1.2
1          1459777  1460585   2       1       1           2
1          1563779  1563827   3       1       1           3
2          1564102  1564215   1       1       1           1
2          1564108  1564219   2       1       1           2

Each of the three steps above can be vectorized, avoiding loops entirely.

ADD REPLYlink modified 25 days ago • written 25 days ago by RamRS22k

Opps, I thought the message didn't sent (accidentally click back), didn't realize it was sent out. Sorry about that. Will now try to complete it.

ADD REPLYlink modified 25 days ago • written 25 days ago by Sam2.4k

Thanks, a lot Sam. It's working. I got one warning message which I can ignore. You made my day!!!!!!!.

Thanks again.

ADD REPLYlink written 25 days ago by archana.bioinfo87160

Could you please explain to us the reason for this exercise? This was too niche to be a commonplace operation and it would be useful to know when we might need such an operation in the future.

ADD REPLYlink written 25 days ago by RamRS22k

Thanks, Sam for your help but unfortunately when I am trying your code I got the error message like this I stored all output value in d4.

Warning message: In dat$Group[i] <- dat$Group + counter + sub.count : number of items to replace is not a multiple of replacement length

d4 [1] 0.1 2.0 3.0 4.0 0.2 5.0 6.0 7.0

Any help is much appreciated.

ADD REPLYlink written 25 days ago by archana.bioinfo87160

There's probably a typo, and it needs to be dat$Group[i] <- dat$Group[i] + .... Please understand code before copy-pasting it, lest you end up corrupting your data some day.

ADD REPLYlink written 25 days ago by RamRS22k

Thanks Ram. I will take care.

ADD REPLYlink written 25 days ago by archana.bioinfo87160

Yes, there's a typo. Have now updated the code

ADD REPLYlink written 25 days ago by Sam2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 803 users visited in the last hour