Extracting a specific string pattern from a list of objects
2
0
Entering edit mode
2.7 years ago
dodausp ▴ 150

Hi,

Here is a recurrent problem I face from time to time, specially because I would rather have object names that resemble my file names other than creating a list with different names for those given files. It is just easier for me to track down a file in case of troubleshooting. So, my question is how can I extract a pattern on my object rather than a specific string of characters? For example:

vcfs <- list.files()

vcfs
[1] "OV-TCGA-05-1456-01.vcf"   "OV-TCGA-05-4578-01.vcf"   "OV-TCGA-08-5666-01.vcf"   "LUSC-TCGA-10-5684-01.vcf" "LUAD-TCGA-02-6574-01.vcf"


So, as you can see, the first part of each file defines the cohort type (OV, LUSC, LUAD) and the rest after "TCGA" is unique to each one of them. I would like to (1) remove the hyphens, (2) keep the cohort name, and (3) keep the 6 digits coming after "TCGA". So it should look like this:

"OV051456", "OV054578", "OV085666", "LUSC105684", "LUAD026574"


Now, I always struggle using those symbols (*, ., ?, \", "") to extract a string from a character object. So, if in addition any of you could also recommend me where to find a good tutorial on those, I would truly appreciate it. And sorry by the simple question. I am not a hardcore bioinformatician. And I love how this community is always so engaging and helpful.

So, thanks a lot in advance!

Cheers,

Douglas

R string subsetting data TCGA SNP • 699 views
0
Entering edit mode

I also find regex confusing, and often refer to this site to help me:

https://regexr.com/

It has good information, cheatsheets, guides, and a live editor in which you can play around with your expressions.

3
Entering edit mode
2.7 years ago
Russ ▴ 470

There's probably a nifty regex one liner that will accomplish the task more efficiently, but my strategy is simple and works:

vcf1 <- gsub("-TCGA-", "", vcf)
vcf2 <- gsub("-[0-9][0-9].vcf", "", vcf1)
vcf3 <- gsub("-", "", vcf2)

> vcf3
[1] "OV051456"   "OV054578"   "OV085666"   "LUSC105684" "LUAD026574"

1
Entering edit mode

That was really helpful! It worked just fine for all the files I had, with no errors. Also, I liked the way you put it up. It made it very easy for me to understand each step of the routine.

Thanks a lot, Russ!

2
Entering edit mode
2.7 years ago
jweile ▴ 20

Having run into this kind of problem several times as well, I have since written a little helper function for these types of situation:

#' Extract regex groups (local)
#'
#' Locally excise regular expression groups from string vectors.
#' I.e. only extract the first occurrence of each group within each string.
#'
#' @param x A vector of strings from which to extract the groups.
#' @param re The regular expression defining the groups
#' @return A \code{matrix} containing the group contents,
#'      with one row for each element of x and one column for each group.
#' @keywords regular expression groups
#' @export
extract.groups <- function(x, re) {
matches <- regexpr(re,x,perl=TRUE)
start <- attr(matches,"capture.start")
end <- start + attr(matches,"capture.length") - 1
do.call(cbind,lapply(1:ncol(start), function(i) {
sapply(1:nrow(start),function(j){
if (start[j,i] > -1) substr(x[[j]],start[j,i],end[j,i]) else NA
})
}))
}


For your specific problem, it can be used as follows:

> groups <- extract.groups(vcfs,"^(\\w+)+-TCGA-(\\d{2})-(\\d{4})-01.vcf\$")
> groups
[,1]   [,2] [,3]
[1,] "OV"   "05" "1456"
[2,] "OV"   "05" "4578"
[3,] "OV"   "08" "5666"
[4,] "LUSC" "10" "5684"
> output <- apply(groups,1,paste,collapse="")
> output
[1] "OV051456"   "OV054578"   "OV085666"   "LUSC105684" "LUAD026574"

1
Entering edit mode

I tried your routine and it worked just nicely as well. And really nice that you also went all the way to explain what the function does. And I was glad to know that I am not alone on this "challenging" issue.

This community is ace! Thanks a lot, jweile!