Question: Match column row wise and do calculations
0
26 days ago by
archana.bioinfo87160 wrote:

Hi, I have a file like this. I have to put some criteria like if chromosome matched, then check start position matched; if matched then check for end position matched. If it ends position not matched then put value 1.1 else put value 1

``````Chromosome Start End
1   990595  990685
1   990595  990882
1   1459777 1460585
1   1563779 1563827
2   1564102 1564215
2   1564108 1564219
``````

I am looking for output like this

`````` Chromosome Start End
1  990595  990685  1.1
1   990595  990882  1.2
1   1459777 1460585 2
1   1563779 1563827 3
2   1564102 1564215 1
2   1564108 1564219 2
``````
R • 193 views
modified 26 days ago by Sam2.4k • written 26 days ago by archana.bioinfo87160
1

You probably need:

``````GenomicRanges::reduce()
GenomicRanges::findOverlaps()
``````

Or from data.table package:

``````data.table::foverlaps()
``````

What have you tried? Please add some context (on the biological problem you're trying to solve) so this pure R question can qualify as a bioinformatics question.

Ok I will modify the question for better understanding.

This looks like an XY problem. What are you doing this exercise for?

I have to find the overlap region if it has been found then add new column like serial number 1.1, 1.2 etc

1

That is a verbatim description of the problem, not the reason why this problem needs solving (See what an XY problem means). There is also another factor you have not mentioned: the numbering resets with the start of a new chromosome. This is quite the convoluted logic and I don't know why you're doing it, so I want to make sure this problem needs solving before starting to solve it.

0
26 days ago by
Sam2.4k
New York
Sam2.4k wrote:

In R, assume your data is stored in a data.frame call dat

``````dat\$Group <- 1
counter<-0
prev.chr <- NA
prev.loc <- 0
sub.count <- 0.1
for(i in 1:nrow(dat)){
if ( is.na(prev.chr) | prev.chr != dat\$Chromosome[i]){
counter <- 0
prev.chr <- dat\$Chromosome[i]
prev.loc <- dat\$Start[i]
sub.count <- 0.1
}
else if(prev.loc == dat\$Start[i]){
dat\$Group[i] <- dat\$Group[i]+counter+sub.count
sub.count<- sub.count+0.1
}else{
prev.loc <- dat\$Start[i]
counter<- counter+1
sub.count <- 0.1
dat\$Group[i] <- dat\$Group[i] + counter
}
}
``````

(There are definitely better ways to go about it, the above codes are just as a reference)

1

Your code is incomplete. Also, your first comparison is `NA` vs a value. See what happens when you do that:

``````> if(NA != 1) print('Hello')
Error in if (NA != 1) print("Hello") :
missing value where TRUE/FALSE needed
``````

On a side note, this sort of programming is like using python or java naively, where language specific features are discarded to "make something work". I'm sure there are better ways in R to partition a dataset and order within partitions in a custom manner, using a loop seems to be really inefficient in a language that vectorizes operations and uses the `apply` family to great effect.

On second thought, I guess OP's problem cannot really be solved without forcing R to process a data.frame row-by-row, which is why I suspect this is an XY problem.

### EDIT: Here's a possible sequence of operations to get to the result:

#### Step-1: group by chromosome and start, and number groups

INPUT:

``````Chromosome Start    End
1          990595   990685
1          990595   990882
1          1459777  1460585
1          1563779  1563827
2          1564102  1564215
2          1564108  1564219
``````

OUTPUT:

``````Chromosome Start    End       gp.1
1          990595   990685    1
1          990595   990882    1
1          1459777  1460585   2
1          1563779  1563827   3
2          1564102  1564215   1
2          1564108  1564219   2
``````

#### Step-2: group by chromosome gp.1; and number groups, while counting group membership

INPUT:

``````Chromosome Start    End       gp.1
1          990595   990685    1
1          990595   990882    1
1          1459777  1460585   2
1          1563779  1563827   3
2          1564102  1564215   1
2          1564108  1564219   2
``````

OUTPUT:

``````Chromosome Start    End       gp.1    gp.2    gp.count
1          990595   990685    1       1       2
1          990595   990882    1       2       2
1          1459777  1460585   2       1       1
1          1563779  1563827   3       1       1
2          1564102  1564215   1       1       1
2          1564108  1564219   2       1       1
``````

#### Step-3: Determine final group

If count is 1 for a group, gp is gp.1. If count > 1, gp = paste(gp.1,gp.2,sep=".")

INPUT:

``````Chromosome Start    End       gp.1    gp.2    gp.count
1          990595   990685    1       1       2
1          990595   990882    1       2       2
1          1459777  1460585   2       1       1
1          1563779  1563827   3       1       1
2          1564102  1564215   1       1       1
2          1564108  1564219   2       1       1
``````

OUTPUT:

``````Chromosome Start    End       gp.1    gp.2    gp.count    final.gp
1          990595   990685    1       1       2           1.1
1          990595   990882    1       2       2           1.2
1          1459777  1460585   2       1       1           2
1          1563779  1563827   3       1       1           3
2          1564102  1564215   1       1       1           1
2          1564108  1564219   2       1       1           2
``````

Each of the three steps above can be vectorized, avoiding loops entirely.

Opps, I thought the message didn't sent (accidentally click back), didn't realize it was sent out. Sorry about that. Will now try to complete it.

Thanks, a lot Sam. It's working. I got one warning message which I can ignore. You made my day!!!!!!!.

Thanks again.

Could you please explain to us the reason for this exercise? This was too niche to be a commonplace operation and it would be useful to know when we might need such an operation in the future.

Thanks, Sam for your help but unfortunately when I am trying your code I got the error message like this I stored all output value in d4.

Warning message: In dat\$Group[i] <- dat\$Group + counter + sub.count : number of items to replace is not a multiple of replacement length

d4 [1] 0.1 2.0 3.0 4.0 0.2 5.0 6.0 7.0

Any help is much appreciated.

There's probably a typo, and it needs to be `dat\$Group[i] <- dat\$Group[i] + ...`. Please understand code before copy-pasting it, lest you end up corrupting your data some day.

Thanks Ram. I will take care.