Unusual behavior of grepl in R.
1
0
Entering edit mode
8.4 years ago
unique379 ▴ 90

Deal all,

I need a explanation why grep function behave usually? Or its just my wrong interpretation?

I have list of character that need to extract from matrix/data frame by row.names. To do this I tried with few subset and grepl function.

> head(miRNAs) ### list of character
[1] "hsa-mir-200b"  "hsa-mir-200a"  "hsa-mir-429"   "hsa-mir-1256" 
[5] "hsa-mir-101-1" "hsa-mir-1262"
> length(miRNAs) ## length
[1] 129
> list=as.character(paste(miRNAs, collapse="|"))
> # with subset and grepl
> extract1=subset(expMatrix, grepl(list, row.names(expMatrix)))
> nrow(extract1)
[1] 150   ## found greater length than actual list, why?
> # without subset
> extract2=expMatrix[grepl(list, row.names(expMatrix)),]
> nrow(extract2)
[1] 150 ## Same here; found greater length than actual list, why?
> # with only subset
> extract3=subset(expMatrix,row.names(expMatrix) %in% miRNAs)
> nrow(extract3)
[1] 129 ## its perfect
> ## without subset and grepl
> extract4=expMatrix[miRNAs, ]
> nrow(extract4) ## its perfect too
[1] 129

So here I have two queries:

  1. Why is grepl behavior odd? With or without subset?
  2. Which trick is suitable to extract list of character from matrix/data frame? extract3 or extract4 which one?

Thanks

R • 2.1k views
ADD COMMENT
3
Entering edit mode
8.4 years ago

It's quite likely that you have something like hsa-mir-200 as a row name, which will match hsa-mir-200a and hsa-mir-200b with grepl but not %in% or directly subsetting.

ADD COMMENT
0
Entering edit mode

indeed its not the case of hsa-mir-200 and hsa-mir-200a. The extra rows are as follows:

21 miR included exclusively in "150":
hsa-mir-3121
hsa-mir-1278
hsa-mir-936
hsa-mir-3163
hsa-mir-3164
hsa-mir-3166
hsa-mir-3173
hsa-mir-3174
hsa-mir-3176
hsa-mir-3177
hsa-mir-3178
hsa-mir-3182
hsa-mir-3187
hsa-mir-1270-1
hsa-mir-1270-2
hsa-mir-3136
hsa-mir-3138
hsa-mir-1271
hsa-mir-1275
hsa-mir-939
hsa-mir-500b ## it could be the same case

However, I observed miR such as hsa-mir-3172 are in above list hsa-mir-3164, hsa-mir-3173 etc. If this is the case and string match only few character not whole word then, there is any argument that I can enable into grepl?? like in linux grep -w (force PATTERN to match only whole words).

ADD REPLY
1
Entering edit mode

If hsa-mir-31 (among a couple others) were in your original list then you could get something like this. There's no point in using grep in R for whole word matches, which is why the option isn't there (though you can always use ^ and $ to denote searching for word bounds).

ADD REPLY
0
Entering edit mode

Thanks Ryan fir your clue....its done :))

ADD REPLY

Login before adding your answer.

Traffic: 2286 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6