Question: Unusual behavior of grepl in R.
0
gravatar for unique379
3.5 years ago by
unique37970
Spain
unique37970 wrote:

Deal all,

I need a explanation why grep function behave usually ?? or its just my wring interpretation ? 

I have list of character that need to extract from matrix/data frame by row.names. To do this i tired with few subset and grepl function.

> head(miRNAs) ### list of character
[1] "hsa-mir-200b"  "hsa-mir-200a"  "hsa-mir-429"   "hsa-mir-1256" 
[5] "hsa-mir-101-1" "hsa-mir-1262" 
> length(miRNAs) ## length 
[1] 129
> list=as.character(paste(miRNAs, collapse="|"))
> # with subset and grepl
> extract1=subset(expMatrix, grepl(list, row.names(expMatrix)))
> nrow(extract1)
[1] 150   ## found greater length than actual list , why ???
> # without subset
> extract2=expMatrix[grepl(list, row.names(expMatrix)),]
> nrow(extract2)
[1] 150 ## ## Same here; found greater length than actual list ,  why ??
> # with only subset
> extract3=subset(expMatrix,row.names(expMatrix) %in% miRNAs)
> nrow(extract3)
[1] 129 ## its perfect
> ## without subset and grepl
> extract4=expMatrix[miRNAs, ]
> nrow(extract4) ## its perfect too
[1] 129

 

so here i have two queries:

1) why grepl behavior is odd ?? with or without subset ??

2) which trick is suitable to extract list of character from matrix/data frame ??  extract3 or extract4 which one ??

Thanks

 

 

R • 1.1k views
ADD COMMENTlink modified 3.5 years ago by Devon Ryan89k • written 3.5 years ago by unique37970
3
gravatar for Devon Ryan
3.5 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

It's quite likely that you have something like hsa-mir-200 as a row name, which will match hsa-mir-200a and hsa-mir-200b with grepl but not %in% or directly subsetting.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Devon Ryan89k

indeed its not the case of hsa-mir-200 and hsa-mir-200a. The extra rows are as follows:

21 miR included exclusively in "150":
hsa-mir-3121
hsa-mir-1278
hsa-mir-936
hsa-mir-3163
hsa-mir-3164
hsa-mir-3166
hsa-mir-3173
hsa-mir-3174
hsa-mir-3176
hsa-mir-3177
hsa-mir-3178
hsa-mir-3182
hsa-mir-3187
hsa-mir-1270-1
hsa-mir-1270-2
hsa-mir-3136
hsa-mir-3138
hsa-mir-1271
hsa-mir-1275
hsa-mir-939
hsa-mir-500b ## it could be the same case

However, i observed miR such as hsa-mir-3172 are in above list hsa-mir-3164, hsa-mir-3173 etc. If this is the case and string match only few character not whole word then, there is any argument that i can enable into grepl ?? like in linux grep -w (force PATTERN to match only whole words).

 

ADD REPLYlink written 3.5 years ago by unique37970
1

If hsa-mir-31 (among a couple others) were in your original list then you could get something like this. There's no point in using grep in R for whole word matches, which is why the option isn't there (though you can always use ^ and $ to denote searching for word bounds).

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Devon Ryan89k

Thanks Ryan fir your clue....its done :))

ADD REPLYlink written 3.5 years ago by unique37970
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1686 users visited in the last hour