Question

Match column row wise and do calculations

0

Entering edit mode

4.8 years ago

archana.bioinfo87 ▴ 210

Hi, I have a file like this. I have to put some criteria like if chromosome matched, then check start position matched; if matched then check for end position matched. If it ends position not matched then put value 1.1 else put value 1

Chromosome Start End   
    1   990595  990685 
    1   990595  990882
    1   1459777 1460585
    1   1563779 1563827
    2   1564102 1564215
    2   1564108 1564219

I am looking for output like this

 Chromosome Start End  
 1  990595  990685  1.1
1   990595  990882  1.2
1   1459777 1460585 2
1   1563779 1563827 3
2   1564102 1564215 1
2   1564108 1564219 2

R • 1.5k views

ADD COMMENT • link updated 4.8 years ago by Sam ★ 4.7k • written 4.8 years ago by archana.bioinfo87 ▴ 210

1

Entering edit mode

You probably need:

GenomicRanges::reduce()
GenomicRanges::findOverlaps()

Or from data.table package:

data.table::foverlaps()

ADD REPLY • link 4.8 years ago by zx8754 11k

0

Entering edit mode

What have you tried? Please add some context (on the biological problem you're trying to solve) so this pure R question can qualify as a bioinformatics question.

ADD REPLY • link 4.8 years ago by Ram 43k

0

Entering edit mode

Ok I will modify the question for better understanding.

ADD REPLY • link 4.8 years ago by archana.bioinfo87 ▴ 210

0

Entering edit mode

This looks like an XY problem. What are you doing this exercise for?

ADD REPLY • link 4.8 years ago by Ram 43k

0

Entering edit mode

I have to find the overlap region if it has been found then add new column like serial number 1.1, 1.2 etc

ADD REPLY • link 4.8 years ago by archana.bioinfo87 ▴ 210

1

Entering edit mode

That is a verbatim description of the problem, not the reason why this problem needs solving (See what an XY problem means). There is also another factor you have not mentioned: the numbering resets with the start of a new chromosome. This is quite the convoluted logic and I don't know why you're doing it, so I want to make sure this problem needs solving before starting to solve it.

ADD REPLY • link 4.8 years ago by Ram 43k

score 0 · Answer 1 · 2019-06-25

0

Entering edit mode

4.8 years ago

Sam ★ 4.7k

In R, assume your data is stored in a data.frame call dat

dat$Group <- 1
counter<-0
prev.chr <- NA
prev.loc <- 0
sub.count <- 0.1
for(i in 1:nrow(dat)){
    if ( is.na(prev.chr) | prev.chr != dat$Chromosome[i]){
        counter <- 0
        prev.chr <- dat$Chromosome[i]
        prev.loc <- dat$Start[i]
        sub.count <- 0.1
    }
    else if(prev.loc == dat$Start[i]){
        dat$Group[i] <- dat$Group[i]+counter+sub.count
        sub.count<- sub.count+0.1
    }else{
        prev.loc <- dat$Start[i]
        counter<- counter+1
        sub.count <- 0.1
        dat$Group[i] <- dat$Group[i] + counter
    }
}

(There are definitely better ways to go about it, the above codes are just as a reference)

ADD COMMENT • link 4.8 years ago by Sam ★ 4.7k

1

Entering edit mode

Your code is incomplete. Also, your first comparison is NA vs a value. See what happens when you do that:

> if(NA != 1) print('Hello')
Error in if (NA != 1) print("Hello") :
  missing value where TRUE/FALSE needed

On a side note, this sort of programming is like using python or java naively, where language specific features are discarded to "make something work". I'm sure there are better ways in R to partition a dataset and order within partitions in a custom manner, using a loop seems to be really inefficient in a language that vectorizes operations and uses the apply family to great effect.

On second thought, I guess OP's problem cannot really be solved without forcing R to process a data.frame row-by-row, which is why I suspect this is an XY problem.

EDIT: Here's a possible sequence of operations to get to the result:

Step-1: group by chromosome and start, and number groups

INPUT:

Chromosome Start    End   
1          990595   990685 
1          990595   990882
1          1459777  1460585
1          1563779  1563827
2          1564102  1564215
2          1564108  1564219

OUTPUT:

Chromosome Start    End       gp.1
1          990595   990685    1
1          990595   990882    1
1          1459777  1460585   2
1          1563779  1563827   3
2          1564102  1564215   1
2          1564108  1564219   2

Step-2: group by chromosome gp.1; and number groups, while counting group membership

INPUT:

Chromosome Start    End       gp.1
1          990595   990685    1
1          990595   990882    1
1          1459777  1460585   2
1          1563779  1563827   3
2          1564102  1564215   1
2          1564108  1564219   2

OUTPUT:

Chromosome Start    End       gp.1    gp.2    gp.count
1          990595   990685    1       1       2
1          990595   990882    1       2       2
1          1459777  1460585   2       1       1
1          1563779  1563827   3       1       1
2          1564102  1564215   1       1       1
2          1564108  1564219   2       1       1

Step-3: Determine final group

If count is 1 for a group, gp is gp.1. If count > 1, gp = paste(gp.1,gp.2,sep=".")

INPUT:

Chromosome Start    End       gp.1    gp.2    gp.count
1          990595   990685    1       1       2
1          990595   990882    1       2       2
1          1459777  1460585   2       1       1
1          1563779  1563827   3       1       1
2          1564102  1564215   1       1       1
2          1564108  1564219   2       1       1

OUTPUT:

Chromosome Start    End       gp.1    gp.2    gp.count    final.gp
1          990595   990685    1       1       2           1.1
1          990595   990882    1       2       2           1.2
1          1459777  1460585   2       1       1           2
1          1563779  1563827   3       1       1           3
2          1564102  1564215   1       1       1           1
2          1564108  1564219   2       1       1           2

Each of the three steps above can be vectorized, avoiding loops entirely.

ADD REPLY • link 4.8 years ago by Ram 43k

0

Entering edit mode

Opps, I thought the message didn't sent (accidentally click back), didn't realize it was sent out. Sorry about that. Will now try to complete it.

ADD REPLY • link 4.8 years ago by Sam ★ 4.7k

0

Entering edit mode

Thanks, a lot Sam. It's working. I got one warning message which I can ignore. You made my day!!!!!!!.

Thanks again.

ADD REPLY • link 4.8 years ago by archana.bioinfo87 ▴ 210

0

Entering edit mode

Could you please explain to us the reason for this exercise? This was too niche to be a commonplace operation and it would be useful to know when we might need such an operation in the future.

ADD REPLY • link 4.8 years ago by Ram 43k

0

Entering edit mode

Thanks, Sam for your help but unfortunately when I am trying your code I got the error message like this I stored all output value in d4.

Warning message: In dat$Group[i] <- dat$Group + counter + sub.count : number of items to replace is not a multiple of replacement length

d4 [1] 0.1 2.0 3.0 4.0 0.2 5.0 6.0 7.0

Any help is much appreciated.

ADD REPLY • link 4.8 years ago by archana.bioinfo87 ▴ 210

0

Entering edit mode

There's probably a typo, and it needs to be dat$Group[i] <- dat$Group[i] + .... Please understand code before copy-pasting it, lest you end up corrupting your data some day.

ADD REPLY • link 4.8 years ago by Ram 43k

0

Entering edit mode

Thanks Ram. I will take care.

ADD REPLY • link 4.8 years ago by archana.bioinfo87 ▴ 210

0

Entering edit mode

Yes, there's a typo. Have now updated the code

ADD REPLY • link 4.8 years ago by Sam ★ 4.7k