Question: Extracting a specific string pattern from a list of objects
gravatar for dodausp
19 months ago by
dodausp140 wrote:


Here is a recurrent problem I face from time to time, specially because I would rather have object names that resemble my file names other than creating a list with different names for those given files. It is just easier for me to track down a file in case of troubleshooting. So, my question is how can I extract a pattern on my object rather than a specific string of characters? For example:

vcfs <- list.files()

[1] "OV-TCGA-05-1456-01.vcf"   "OV-TCGA-05-4578-01.vcf"   "OV-TCGA-08-5666-01.vcf"   "LUSC-TCGA-10-5684-01.vcf" "LUAD-TCGA-02-6574-01.vcf"

So, as you can see, the first part of each file defines the cohort type (OV, LUSC, LUAD) and the rest after "TCGA" is unique to each one of them. I would like to (1) remove the hyphens, (2) keep the cohort name, and (3) keep the 6 digits coming after "TCGA". So it should look like this:

"OV051456", "OV054578", "OV085666", "LUSC105684", "LUAD026574"

Now, I always struggle using those symbols (*, ., ?, \", "") to extract a string from a character object. So, if in addition any of you could also recommend me where to find a good tutorial on those, I would truly appreciate it. And sorry by the simple question. I am not a hardcore bioinformatician. And I love how this community is always so engaging and helpful.

So, thanks a lot in advance!



tcga snp subsetting data string R • 487 views
ADD COMMENTlink modified 19 months ago by jweile20 • written 19 months ago by dodausp140

I also find regex confusing, and often refer to this site to help me:

It has good information, cheatsheets, guides, and a live editor in which you can play around with your expressions.

ADD REPLYlink written 19 months ago by Russ460
gravatar for Russ
19 months ago by
Ontario Veterinary College, University of Guelph, Guelph, Ontario, Canada
Russ460 wrote:

There's probably a nifty regex one liner that will accomplish the task more efficiently, but my strategy is simple and works:

vcf1 <- gsub("-TCGA-", "", vcf)
vcf2 <- gsub("-[0-9][0-9].vcf", "", vcf1)
vcf3 <- gsub("-", "", vcf2)

> vcf3
[1] "OV051456"   "OV054578"   "OV085666"   "LUSC105684" "LUAD026574"
ADD COMMENTlink written 19 months ago by Russ460

That was really helpful! It worked just fine for all the files I had, with no errors. Also, I liked the way you put it up. It made it very easy for me to understand each step of the routine.

Thanks a lot, Russ!

ADD REPLYlink written 19 months ago by dodausp140
gravatar for jweile
19 months ago by
jweile20 wrote:

Having run into this kind of problem several times as well, I have since written a little helper function for these types of situation:

#' Extract regex groups (local)
#' Locally excise regular expression groups from string vectors.
#' I.e. only extract the first occurrence of each group within each string.
#' @param x A vector of strings from which to extract the groups.
#' @param re The regular expression defining the groups
#' @return A \code{matrix} containing the group contents, 
#'      with one row for each element of x and one column for each group.
#' @keywords regular expression groups
#' @export
extract.groups <- function(x, re) {
    matches <- regexpr(re,x,perl=TRUE)
    start <- attr(matches,"capture.start")
    end <- start + attr(matches,"capture.length") - 1,lapply(1:ncol(start), function(i) {
            if (start[j,i] > -1) substr(x[[j]],start[j,i],end[j,i]) else NA

For your specific problem, it can be used as follows:

> groups <- extract.groups(vcfs,"^(\\w+)+-TCGA-(\\d{2})-(\\d{4})-01.vcf$")
> groups
     [,1]   [,2] [,3]  
[1,] "OV"   "05" "1456"
[2,] "OV"   "05" "4578"
[3,] "OV"   "08" "5666"
[4,] "LUSC" "10" "5684"
[5,] "LUAD" "02" "6574"
> output <- apply(groups,1,paste,collapse="")
> output
[1] "OV051456"   "OV054578"   "OV085666"   "LUSC105684" "LUAD026574"
ADD COMMENTlink written 19 months ago by jweile20

I tried your routine and it worked just nicely as well. And really nice that you also went all the way to explain what the function does. And I was glad to know that I am not alone on this "challenging" issue.

This community is ace! Thanks a lot, jweile!

ADD REPLYlink written 19 months ago by dodausp140
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 971 users visited in the last hour